Skip to content
6 min

Prompt regression is real (and the only defence is a regression test)

What happens when you change a prompt to fix one specific case? Sometimes the fix lands. More often, a different category of input now breaks - a routine status question now classified as a refund request, a polite query now answered curtly - and the engineer who made the change doesn't see it for days. The original case is fixed; the new failures look like noise until they pile up.

This is prompt regression. Every prompt change has effects beyond the case it was intended to fix. The effects are non-deterministic, distributed across the input space, and invisible to the engineer making the change. The discipline that catches them is identical to the discipline of software regression testing, except almost no team actually runs it for prompts.

Prompts are code. They behave differently from code in three specific ways. They have non-deterministic outputs, so identical inputs produce different outputs across runs. They have continuous failure modes, so a prompt can be 92% correct, then 88% correct, with no exception thrown. And they have implicit dependencies on the underlying model, so the same prompt behaves differently after a model upgrade. None of these properties make regression testing impossible. They make it more important.

The shape of a regression test

The shape of prompt regression testing.

The labelled set has 100-300 examples covering the input distribution. Each example has a known correct response (or, for open-ended outputs, a known correct property - refuses correctly, stays on topic, returns valid JSON, contains specific facts). The set is versioned in source control. It has examples from real production traffic, hand-labelled, with the boring 80% and the awkward 20%.

The test runs on every prompt change. CI runs the prompt against the labelled set. Failures over a threshold block the merge. Pass rates trend over time. The trend is the metric that matters; a single bad run is noise, three bad runs in a week is a signal.

The test runs on every model upgrade. Provider models change. The prompt that worked on the previous version may behave subtly differently on the new one. The regression test is what catches the behaviour change before the user does.

The test runs on a schedule. Even without changes to the prompt or the model, output distributions drift slowly. Run the regression weekly in production. The trend over 90 days will show whether the prompt is aging.

Three specifics that work

Three specifics that make the discipline practical.

Use LLM-as-judge for open-ended outputs, but with a labelled rubric. For each example, the judge is asked specific questions: did the response answer the user's question, did it cite the right policy, did it refuse appropriately. The rubric is more reliable than asking the judge 'is this answer good'.

Track per-category pass rates, not just the overall number. The overall number can stay flat while one category silently regresses. If 'refund inquiries' pass rate drops from 95% to 80% while 'status inquiries' goes from 88% to 95%, the overall number is unchanged and the deployment is broken.

Snapshot the bad cases. When the regression detects a failure, save the full input/output/expected so the engineer can inspect it directly. Don't just report a number. The number tells you the prompt got worse. The snapshot tells you why.

A sanity check worth running: every prompt change PR should include the regression test results in the PR description. If the PR shows 'regression: 91 → 89 (-2%)', the reviewer can decide whether the targeted improvement is worth the regression. If the PR shows nothing, the reviewer is approving a change with unknown effects.

Prompts without regression tests are unmonitored production code. They might be working. They might have been silently regressing for months. Without the test, you don't know which, and you find out only when a customer escalates.

// The artefact
# prompts/regression.py: every prompt PR runs against the labelled set
@dataclass
class RegressionCase:
    id: str
    input: dict
    category: str
    must_pass: list[str]   # rubric items checked by an LLM-as-judge

def run_regression(prompt: str, cases: list[RegressionCase]) -> dict[str, float]:
    by_cat = defaultdict(lambda: [0, 0])
    for c in cases:
        out = call_llm(prompt, c.input)
        passed = all(judge(out, item) for item in c.must_pass)
        by_cat[c.category][0] += int(passed)
        by_cat[c.category][1] += 1
    # Pass rate per category - the overall number hides regressions in one segment.
    return {cat: hits / total for cat, (hits, total) in by_cat.items()}

Per-category pass rates, not one headline number. The overall score can stay flat while one segment silently breaks.