Skip to content
6 min

Boring beats brilliant in production ML

It's Wednesday morning. The fraud detection model that worked beautifully in last month's pilot is now flagging seventy percent of transactions as suspect. The ops team has its phone off the hook. Nobody has the faintest idea why.

There's no rollback mechanism. No model registry. No drift dashboard. The data scientist who trained it has been on leave for three weeks and the parameters live in a Jupyter notebook that currently can't be found. The team is deciding, in real time and under pressure, whether to turn the model off entirely.

This is the scenario most ML projects are secretly one week away from. And the cause is almost never the model. It's the infrastructure the model was allowed to ship without.

A model you can monitor, retrain, and roll back is worth ten that almost work. In practice that means making a series of deliberately unglamorous choices at every decision point: simpler architectures where they're good enough, batch processing where real-time isn't strictly required, linear baselines you actually trust. The interesting failures in production ML rarely come from the model itself. They come from forgotten features, silent data drift, upstream pipelines that quietly started returning nulls, and, most often, the absence of any mechanism to detect these things before they become Wednesday mornings.

Three places where boring wins

Three examples of where the boring choice consistently outperforms the brilliant one.

First: gradient-boosted trees over deep learning, on tabular data. If your features sit in a relational database (customer attributes, transaction history, product metadata), XGBoost, LightGBM, or CatBoost will almost always beat a deep neural network on both accuracy and operational cost. This has been measured repeatedly in the literature and confirmed by anyone who's actually had to deploy both. Tree models train in minutes on a laptop, produce intuitively interpretable feature importances, and fail in predictable ways when input distributions shift. A neural net takes hours to train, requires a GPU to serve in production at useful latency, and has failure modes that look indistinguishable from correct behaviour until you check the calibration. The one-to-two percent accuracy bump you sometimes get is almost never worth the eightfold increase in operational complexity.

Second: batch inference over real-time. Whenever a requirement lands that specifies 'real-time predictions', the first question to ask is whether the user's experience genuinely changes if the prediction arrives two hundred milliseconds later, or two hundred minutes later. For fraud detection on card-present transactions: milliseconds matter. For credit limit recommendations: hours are fine. For churn scoring that feeds a weekly campaign: days are fine. Most 'real-time' requirements are real-time-ish, and the cost of converting a hard real-time system into a nightly batch one is usually an order of magnitude reduction in infrastructure, monitoring, and failure surface. The batch pipeline writes output to a table anyone can read with SQL. The real-time service needs a message bus, a low-latency inference server, a feature store, a cache invalidation strategy, and on-call rotations. If you don't actually need the latency, the complexity is a tax you pay forever.

Third: linear baselines you can actually explain. Before deploying anything sophisticated, build the dumbest thing that could possibly work. For credit risk: logistic regression on a dozen well-chosen features. For document classification: TF-IDF plus a linear SVM. For demand forecasting: seasonal naive. Then measure. Roughly half the time, this baseline is competitive enough with the state-of-the-art approach that the simpler model is obviously the right production choice. The other half of the time, you've learned exactly how much the complexity is buying you, in quantifiable units, and you have a legible comparison to show stakeholders. Boring baselines have a second property that's underrated: they survive being left alone. A logistic regression doesn't slowly degrade because its training-data distribution assumed a world that's since changed. Its coefficients are printed out and filed. You can re-fit it in an afternoon if you have to. The fine-tuned transformer, meanwhile, is a black box that was trained on a specific snapshot of your data, using a specific library version, on a specific machine that may or may not still exist.

The counter-argument is always the same: but the boring model is two percent worse. And the answer is always the same: two percent less accuracy on a system that stays in production for three years will always outperform five percent more accuracy on a system that gets turned off in six months because nobody understands it anymore. That's not hyperbole. It's roughly what happens, in sequence, in most shops that over-index on leaderboard metrics and under-index on operational simplicity.

Accuracy is not a conserved quantity. It degrades silently the moment the world moves and nobody is watching. If the monitoring story is 'we'll check the logs when something goes wrong', the accuracy number on the slide deck is a snapshot, not a forecast.

A rule of thumb worth following before shipping. Is there a simpler model we could try first, and did we benchmark against it honestly? Can we retrain on a new week of data in under an hour, without the original trainer being present? If this model starts returning garbage tomorrow, how fast can we roll back, and what do we roll back to? Is the batch version of this problem actually unacceptable, or is it just less exciting? When the inputs drift, will we know, or will the customer tell us?

Anything building toward 'no' on any of those questions is a place where boring is probably the right answer.

// The artefact
# baselines/first.py: fit this before anything fancy; only beat it measurably
def fit_baseline(X_train, y_train, X_holdout, y_holdout) -> Baseline:
    model = LogisticRegression(max_iter=1000).fit(X_train, y_train)
    score = average_precision_score(y_holdout, model.predict_proba(X_holdout)[:, 1])
    print(f"Baseline AP = {score:.3f}. Beat this measurably or ship the baseline.")
    return Baseline(model=model, score=score)

Logistic regression. Trains on a laptop in seconds. Half the time, it ships. The other half, you've measured the actual gain.