Skip to content
5 min

Labels are the moat: why your training data matters more than your model

A common belief in contemporary machine learning: the model is the hard part. Foundation models, fine-tuning, architecture search, and hyperparameter optimisation absorb the majority of technical conversations, conference talks, and engineering hires.

This is almost backwards.

In most production ML problems, the thing that most constrains your final accuracy is not the choice of model. It is the quality of the labels in your training data. Switching from a logistic regression to a transformer might buy you three to five percentage points of accuracy. Improving your label quality by twenty percent (not by labelling more data, just by making the existing labels more consistent) routinely buys you ten to fifteen percentage points. Most practitioners discover this the hard way, after six months of chasing model improvements that turn out to be dominated by label noise they didn't know they had.

Why labels dominate

The first is that models learn whatever signal is in the labels, including the noise. If two annotators disagree thirty percent of the time on which category a document belongs to, the model can only do as well as the annotators at the easy cases, and worse than either on the hard ones. The hard cases are where business value lives. Your labels are implicitly setting a ceiling on accuracy, and most teams don't measure the ceiling.

The second is that label errors aren't random. Systematic biases in labelling produce systematic biases in predictions that look like model failures. An annotator who consistently mislabels applications from a particular demographic will produce a model that discriminates against that demographic, and the model will be blamed even though the root cause is human. Fairness audits often find this pattern; they rarely fix it, because fixing it means going back to the labels.

The third is that labels age. The definition of 'customer churn' in 2023 is subtly different from the definition in 2025, because the business's understanding of what counts as churn has evolved. The model trained on 2023 labels predicts 2023-style churn, which is not what the business cares about anymore. Keeping labels current is a business process, not a modelling task, and it's the one most teams never set up.

So what does this mean in practice?

Before hiring a third data scientist, measure inter-annotator agreement on your existing labelled data. Take a random sample of two hundred items. Have two people label them independently. Compare. If the agreement rate is below ninety percent, your label ceiling is probably the problem, not the model. Measure it first.

Invest in annotation guidelines before annotation tools. The tool only matters if the humans using it agree on what they're labelling. A written one-page guide, with twenty worked examples and a section for edge cases, is the single highest-leverage artefact in most ML projects. It's also the one most often skipped.

Treat labels as versioned data. When the business definition of a concept changes, you don't update the labels in place; you version them. Train two models, compare their behaviour, and know exactly what changed. Most teams update labels in place and then wonder why their model's behaviour changed mysteriously in production.

Build a disagreement process. When an annotator is unsure, they should escalate, not guess. An escalation queue, reviewed weekly by a subject matter expert, catches the boundary cases that would otherwise become silent label noise. This process produces better labels and, as a side effect, a running catalogue of the edge cases that most need attention in the model.

Use active learning for the second round, not the first. Don't ask a model to tell you what to label until you have a first model you trust. Random sampling for the first batch of labels is underrated, because it exposes you to the actual distribution of the problem. Active learning on top of a badly-calibrated first model amplifies the biases in that first model.

Foundation models, fine-tuning, and architecture choices are real levers. But they are second-order levers. The first-order lever is whether the humans agree on what they're training the model to do, and whether you can detect when they stop agreeing. A team that spends forty percent of its time on the labels, twenty percent on the model, and forty percent on evaluation will routinely outperform a team that inverts those numbers.

The model is a commodity. The labels are the moat.

// The artefact
# labels/agreement.py: measure your label ceiling before measuring your model
def check_agreement(annotator_a: list[str], annotator_b: list[str], min_kappa: float = 0.8) -> None:
    kappa = cohen_kappa_score(annotator_a, annotator_b)
    if kappa < min_kappa:
        raise LabelQualityError(f"Inter-annotator κ={kappa:.2f} below {min_kappa}; fix labels first.")

Cohen's κ on a 200-item double-annotated sample. Below 0.8, the labels are the bottleneck, not the model.