6 min

Eval sets are contracts, not afterthoughts

It's week four of a six-week pilot. The model has been trained, the integration is half-built, and someone asks the question that should have been asked in week one: how do we know it's good?

The team scrambles. They write fifteen test cases. The model passes most of them. Everyone declares victory. The model ships. Six months later it has a measured accuracy of 94 percent on those fifteen cases and a measured accuracy of nobody-quite-knows on the remaining 95 percent of inputs the system actually sees.

The eval set was written last. It was written by the people who built the model. It was written under time pressure with the model's strengths in mind, not the user's needs. And it became the criterion against which the system was judged for the next three years. This is the most common quiet failure of ML projects, and it's almost entirely structural.

Why this happens

An eval set is a contract. It's the document that says: here is what 'good' means, written down, with concrete examples. The model is judged against the contract. The contract should be hard to satisfy and easy to read. It should be written by someone who understands the use case from the user's side, not from the model's side. And it should exist before any modelling starts, because otherwise the model will quietly grow up to fit whatever ad-hoc tests survive the rush of week five.

How to do it right

The right way is dull and slow. In week one, sit with the people who will judge whether the system is working (the ops team, the analysts, the SMEs, the customers) and write down 50 to 200 examples of inputs paired with the answers that should be returned. Each example has a rationale: why this answer, not another. The examples cover the boring 80 percent (so the system has somewhere to land) and the awkward 20 percent (so the system fails honestly). When two SMEs disagree about the right answer, that's not a problem with the eval set; it's a problem with the spec, and it's far cheaper to surface in week one than in week thirty.

Once the eval set exists, three things follow. First, it gets versioned. Every change to the eval set is a deliberate decision recorded in source control with a commit message explaining what changed and why. The eval set you started with is the eval set you can compare to in six months, assuming you didn't quietly water it down to make the chart go up.

Second, it gets run on every change. This is the part teams skip: the eval set lives next to the code, runs in CI, blocks merges that regress on it. Skipping this turns the eval set into a quarterly exercise, which means the metric you cite on slide 11 of the QBR is a snapshot of three months ago.

Third, it gets defended. When someone asks 'is the model working?', the answer is the eval set's score, broken down by category, across the most recent run. Not opinion, not anecdote, not a screenshot of a single great example. The eval set is the answer.

A few specifics that keep the discipline honest. Make sure the eval set has examples the model has never seen during training; easy to forget when the same SME annotates both. Include adversarial cases on purpose: the malformed input, the edge of the supported domain, the genuinely ambiguous example where the right answer is 'I don't know'. Track refusal as a measured outcome, not a failure. And keep one or two examples in reserve, never shown to anyone, that you only use to gut-check the eval set itself.

The version of this argument we'd write for a sceptical reader would say: a model with a small clear eval set is more useful than a model with a large unclear one. The eval set is what survives staff turnover, framework changes, model upgrades, and prompt revisions. Everything else in the system gets rebuilt. The contract is what stays.

// The artefact

# evals/contract.py: an eval set looks like data, behaves like a contract
@dataclass
class EvalCase:
    id: str
    input: dict
    expected: dict       # what 'correct' looks like
    rationale: str       # why this is correct, not another value
    category: str        # for slicing the score
    sme: str             # who signed off on the answer
    added: date          # when, so we can compare runs over time

def run_eval(model, cases: list[EvalCase]) -> EvalReport:
    results = [(c, score(model.predict(c.input), c.expected)) for c in cases]
    return EvalReport.from_results(results)  # by category, never just one number

An eval case is a row of structured data with provenance: who wrote it, why, when. It's source-controlled and it survives staff turnover.

← Back to Insights