Skip to content
6 min

Anomaly detection: choose supervised when you can

Most projects scoped as 'anomaly detection' should be classification problems. The category exists in the literature, so projects get scoped under that name even when the data fits supervised classification better. A bank running a fraud-detection project, for example, often has eighteen months of confirmed fraud labels - the right approach is supervised classification on those labels, not unsupervised novelty detection on undefined 'normality'.

The wrong framing is expensive. An unsupervised model trained on 'normal' transactions flags 4% of all transactions, including most of the new product categories the bank just launched, and the fraud team buries the alerts and goes back to manual review. The algorithm wasn't the problem; the framing was.

This pattern recurs in almost every anomaly detection project. Teams reach for unsupervised methods (isolation forest, autoencoders, statistical outlier tests) when supervised methods would work better, because the problem was framed as 'detect things that look unusual' instead of 'detect things in the same class as last quarter's known bad ones'.

Supervised classification beats unsupervised novelty detection whenever you have labels, even imperfect ones. The threshold for 'having labels' is low. A few hundred confirmed positives is enough to start. The reason this advice gets ignored is that anomaly detection is a category that exists in the literature, so projects get scoped under that name even when the data doesn't fit.

Three diagnostic questions

Three diagnostic questions to apply before reaching for unsupervised methods.

Have you seen the bad cases before? Is there any history of the thing you're trying to catch (fraud, equipment failure, churn, security incidents)? If yes, you have labels, even if they're sparse and noisy. Use them. A supervised model on 500 known frauds will outperform an unsupervised model on 5 million transactions.

Are the bad cases similar to each other? Anomaly detection assumes anomalies are unusual in the same direction. Real fraud, in real banks, looks like other real fraud: recurring patterns, methods, signatures. It doesn't look like 'everything weird'. The supervised model learns the patterns of fraud; the unsupervised model learns the patterns of normal-and-flags-everything-else.

Are the false positives expensive? Unsupervised models generate too many false positives because they're not optimised for the cost asymmetry between false positives and false negatives. A flagged-but-not-actually-fraudulent transaction wastes an analyst's time. A flagged false positive on equipment monitoring causes a maintenance call-out at 2am. Supervised models can be tuned to the cost asymmetry; unsupervised models can only be threshold-tuned, which is a much blunter instrument.

When unsupervised is the right tool

When unsupervised really is the right tool. There are real cases. When the failure mode is genuinely novel and you have no historical labels (a brand-new system going live, a new sensor type). When the data is too high-dimensional to label sensibly. When the goal is exploration rather than production decision-making (find unusual patterns for a human to investigate, not auto-flag them).

A practical hybrid that often beats both pure approaches: use unsupervised methods to surface candidates for human review, then build a supervised model from the confirmed labels that come back. After three months of operation, the supervised model has enough labels to take over and the unsupervised model retires.

'Anomaly detection' is over-used as a project label, and the cost of using the wrong framing is high. The team that reframes a fraud detection project as a classification problem in week one ships a working system. The team that builds a beautiful isolation forest in week one ships an alert generator that the operations team disables in week three. The technical work is similar; the framing decides whether the work matters.

// The artefact
# detect/framing.py: pick the model from the labels, not the project name
def pick_detector(history: pd.DataFrame, labels: pd.Series) -> Detector:
    n_pos = labels.sum() if labels is not None else 0
    if n_pos >= 500:
        # Supervised wins - labels are the signal.
        return GradientBoostingDetector(scale_pos_weight=class_balance(labels))
    if n_pos >= 50:
        # Hybrid: supervised on what we've labelled, unsupervised as backup.
        return Hybrid(
            primary  = GradientBoostingDetector().fit(...),
            fallback = IsolationForest(contamination=0.01).fit(history),
        )
    # Truly unlabelled - surface candidates for human review, build labels.
    return IsolationForest(contamination=0.01).fit(history)

The right detector follows from the labels you have, not the project name. Most 'anomaly detection' projects should be supervised classification by week three.