5 min

Pick a metric that reflects what you actually want

The most important decision in a machine learning project is the first one: what metric will the model be evaluated on? Most projects that are eventually judged to have failed did not fail at modelling. They optimised the wrong metric and succeeded, by that metric, all the way to an unusable system.

Consider a fraud detection model with ninety-nine percent accuracy that missed every fraudulent transaction. The model was correct on its metric, because fraud was only one percent of the data, so predicting 'not fraud' for every transaction gave you ninety-nine percent accuracy by construction. The model was useless in practice. Everyone on the project knew the outcome was wrong. Nobody had planned in advance what 'right' would look like, beyond 'high accuracy'.

This pattern is not rare. It is the default failure mode of ML projects in businesses that are not already rigorous about evaluation. Business questions (reduce fraud, improve recommendations, forecast demand) are usually fuzzier than the metrics they get translated into. When the project starts, 'reduce fraud' gets translated into 'maximise accuracy' because accuracy is easy to compute. Months later, the model has maximised accuracy and the business has the same amount of fraud.

Common metric failures

The first is accuracy on imbalanced data. If one class is rare, accuracy is dominated by the majority class and says almost nothing about performance on the class that matters. Precision (what fraction of predicted positives are actually positive), recall (what fraction of actual positives the model found), F1 (their harmonic mean), and AUC (performance across classification thresholds) all exist precisely for this situation. The model that optimises one of these is usually materially different from the model that optimises accuracy, and only one of them is the model the business wants.

The second is optimising a training-time loss that isn't the business metric. A credit-risk model trained to minimise log-loss is a reasonable starting point. But the business metric is dollars, specifically, dollars of loss avoided minus dollars of approved business lost to false-positives. These are not the same thing. A model with slightly worse log-loss but better-calibrated probabilities in the decision region might be materially better for the business, and you'd never see that if you only looked at the training metric. The right move, at the end of a training run, is to simulate the model's decisions on a held-out set and report the business metric, in dollars, not the training metric.

The third is collapsing a multi-objective problem into a single number. Most real business problems have genuine tradeoffs. A recommendation system trades engagement against diversity. A churn model trades precision against recall. A pricing model trades revenue per customer against total customer count. Reporting a single composite metric forces an implicit weighting of these tradeoffs, and whoever chose the weighting usually did not consult the business. Better practice: report the metric on the Pareto frontier. Show the business what the tradeoff looks like across a few operating points. Let them choose the point, explicitly.

Evaluation set composition matters at least as much as metric choice.

Random sampling the evaluation set is the default; it is also often wrong. If the rare cases are the cases you care about most, a random sample underweights them. A stratified sample, with explicit over-representation of the cases that drive business value, gives you a more useful picture, even if it's not a representative sample of production traffic. You can always weight the final number back to production proportions if you need to.

Time-aware splits beat random splits for anything that will run in production. Training on a random eighty percent of history and testing on the remaining twenty percent tells you how well the model would have done if it could see the future during training. That is not what it will do in production. Training on, say, everything up to January, and testing on February to April, tells you how well the model will do on actual new data. The numbers are almost always worse than the random split suggests. That's the point.

The model that wins on the right metric almost always loses on other metrics. That's fine, and is in fact a sign you chose a real metric. What goes wrong is when the metric is chosen by default, before anyone has thought about what the business is actually asking for. By the time the model exists, changing the metric means changing the project, which is rarely negotiable mid-flight, and by then you've already trained a model to optimise for the wrong thing.

Pick the metric deliberately, at the start. Whatever you pick becomes the thing the system optimises for. That's the deal. Choose accordingly.

// The artefact

# eval/business_value.py: score in dollars, not log-loss
def evaluate(model, holdout: pd.DataFrame, fp_cost: float, fn_cost: float) -> float:
    proba = model.predict_proba(holdout.drop("y", axis=1))[:, 1]
    preds = proba > 0.5
    fp = ((preds == 1) & (holdout.y == 0)).sum()
    fn = ((preds == 0) & (holdout.y == 1)).sum()
    return -(fp * fp_cost + fn * fn_cost)

Train on log-loss; evaluate on the metric the business pays in. They're rarely the same number.

← Back to Insights