Skip to content
5 min

Drift detection you can set up in an afternoon

Most machine learning systems in production have no drift detection. Nobody plans to leave it out; it just never makes it into scope. The pilot works, the production build ships, the team moves on to the next project. Three months later the model starts making bad predictions. Nobody notices for another two months, by which point a customer has complained, a dashboard has been scrutinised, and somebody has been asked to write a post-mortem that will quietly never be written.

This is almost entirely avoidable. Basic drift detection is trivially cheap. You do not need a full MLOps platform, a managed feature store, or a PhD. An afternoon of work, revisited once a quarter, covers the eighty-twenty.

What to monitor

All three can be implemented in a single SQL query or Pandas script run nightly, with the output dumped into a small table and graphed in Metabase, Grafana, or whatever the team already uses.

First: input feature distributions. For each numeric feature that the model consumes, compute its mean, standard deviation, and a few percentiles (p5, p25, p50, p75, p95). Compare against the same statistics from the training set. A shift of more than two standard deviations in any of these, over a rolling fortnight, is worth an investigation. For categorical features, track the distribution of values, the share of each category, and the appearance of any new categories that were not in training data. Most real-world drift shows up here first: a new postcode that did not exist in your training set, a payment method that has only just become popular, a spike in null rates because an upstream pipeline is broken. All of these are legible in the input distribution before they become visible in the output.

Second: prediction distributions. What fraction of predictions fall into each class for classification? What is the mean and variance of the predicted value for regression? This is cheap to compute and tells you things the input distribution will not. If the inputs look stable but the prediction distribution has shifted (say, the fraction of 'approve' predictions has dropped from seventy percent to forty percent over a week), the model is now operating differently on similar inputs. This almost always means either a deployment bug or a concept drift that the input features are not capturing. Both need attention; both require investigation beyond the data.

Third: upstream data quality. The most common cause of mysterious model regression is not the model at all; it is a silent failure in the pipeline feeding it. Track null rates per column, cardinality of categoricals, and checksums of total record counts. When a pipeline starts returning a constant value for a column that should be variable, the model's predictions stop responding to real signal. When null rates jump from three percent to thirty percent on a feature that matters, the model fills in a default that was not representative of the training distribution. Basic data-quality checks catch these before the model does.

What you do not need on day one: formal statistical drift tests. Kolmogorov-Smirnov, Population Stability Index, and Wasserstein distance are all useful, but only once you already have the basics. None of them is the first thing to set up. Drift tests tell you the probability that two distributions differ; they do not tell you that a new category of loan application is appearing, or that one of your upstream systems has started returning stale data. Start with the basics. Add the tests later, once the basics are actually in place.

Three patterns you will see

Three patterns are worth naming, because you will see them recur. One: a feature's mean shifts by two standard deviations over two weeks. This is almost always an upstream data issue (a source system changing how it calculates something, a default value getting applied where nulls used to be, a logging schema update). Two: a formerly-rare category suddenly spikes. This is usually a business change (a new product, a new geography, a promotion that altered user behaviour) rather than a bug, but it is always worth knowing about before the model does anything expensive with it. Three: the prediction distribution stays stable but accuracy, where you have labels, drops. This is concept drift. The model still thinks the world looks the way it used to, but the world has changed. The fix is retraining, not debugging the pipeline.

You do not have to do drift detection well to do it enormously better than 'we'll check the logs when a customer complains'. Do it badly, do it today. The team that spends an afternoon on this in month one will save weeks of incident response in month seven.

// The artefact
# monitors/drift.py: the basic check, runnable in a nightly cron
def feature_drift(reference: pd.Series, current: pd.Series, max_z: float = 2.0) -> bool:
    z = (current.mean() - reference.mean()) / reference.std()
    if abs(z) > max_z:
        alert(f"{reference.name}: mean shift z={z:.2f}")
        return True
    return False

No vector store, no platform. A rolling z-score per feature catches the most common production failures.