Skip to content
7 min

Migrating models to v2 without breaking your users

A team trains a new fraud model. Eval scores improve from 0.84 to 0.91 average precision, a real lift. The team deploys it on Tuesday. By Friday, customer complaints have spiked, the operations team can't keep up with manual review, and someone discovers that the new model is flagging twice as many transactions as the old one despite scoring better on the held-out test. By Monday, v2 is rolled back, and the team is asking what went wrong.

What went wrong is that the eval set measured one thing (accuracy on a labelled holdout) and the production environment measured another: the operational impact of the model's decisions. The new model's distribution of confidence scores was different. The downstream system, with its threshold tuned for v1's distribution, was triggering twice as often on v2 even though v2's true positives were better.

This is the standard model-migration failure. The eval said 'ship'. The production environment said 'this changes things'. The eval set didn't capture what the production environment actually cared about, and there was no intermediate step between training and full traffic.

A model that beats v1 on the eval set is a candidate for v2, not a replacement. The migration from candidate to replacement is its own project. Skipping it is what causes Friday afternoons like the one above.

Four stages of safe migration

The shape of a safe model migration has four stages, each with its own gate.

Stage one: shadow deploy. The new model serves alongside the old one but doesn't influence any decisions. Both models score every request; only the old model's score is used. The new model's predictions are logged for offline comparison. This stage runs for a defined period (usually 1-2 weeks) to surface any production-only failure modes: input distributions the eval set didn't represent, edge cases, infrastructure issues. The gate to advance: zero production-side issues, eval-grade accuracy on the production sample, distribution of outputs as expected.

Stage two: canary. The new model serves a small percentage of real traffic (start at 1%, ramp to 5%, then 10%). The decisions are real; users see the new model's behaviour. Downstream impact is monitored: operations team workload, customer complaints, business metrics that don't appear in the eval. The gate: business metrics flat or better, no operational alarms, customer-side metrics stable.

Stage three: blue-green ramp. Once the canary is clean, ramp to 50%. Then 100%. Each step has a hold time long enough to detect downstream effects (usually 24-48 hours). The gate: stability across the ramp.

Stage four: deprecation. The old model still serves nothing but is kept available for instant rollback. Only after several weeks of clean production on v2 is v1 actually decommissioned.

The thing that makes this work is the rollback path. At every stage, reverting to the old model is a single config change. If something breaks at canary, the canary gets turned off, not investigated for an hour while the on-call engineer figures out the deploy pipeline. The rollback being a one-liner is what makes the stages safe to take.

What production catches that offline evals don't

Distribution shift between training and production. The eval set is a snapshot. Production input distributions drift weekly. A model trained six months ago and evaluated on a six-month-old test set might be optimal for a world that no longer exists.

Downstream coupling. Models live inside systems that have thresholds, queues, and human reviewers tuned to the old model's behaviour. A new model with different output statistics breaks those couplings, even when its accuracy is higher.

Feedback-loop effects. Models that influence the data they later get retrained on (recommender systems, fraud models that determine which transactions get reviewed) can compound problems in ways that don't appear in offline evaluation.

Model migrations are operational projects, not modelling projects. The team that ships v2 reliably has a migration playbook, a rollback plan, and a phased deploy. The team that doesn't ships v2 on a Tuesday and rolls back on a Friday, sometimes more than once. The accuracy lift in the slide deck is real. The rest of the work is what determines whether it stays in production.

// The artefact
# deploy/canary.py: deterministic routing per request id, both predictions logged
class CanaryRouter:
    def __init__(self, primary: Model, candidate: Model, traffic_pct: float):
        self.primary, self.candidate = primary, candidate
        self.traffic_pct = traffic_pct  # 0.0 → 1.0

    def predict(self, x, request_id: str) -> Prediction:
        on_canary = stable_hash(request_id) % 100 < self.traffic_pct * 100
        chosen = self.candidate if on_canary else self.primary
        result = chosen.predict(x)
        # Always score with both, log the comparison for offline analysis.
        shadow = self.primary.predict(x) if on_canary else self.candidate.predict(x)
        log_pair(chosen.name, request_id, result, shadow)
        return result

# Rollback = set traffic_pct = 0. One config change, no deploy.

Deterministic routing per request, both models scored on every request, rollback is a single config flip. The rollback path being one line is what makes the canary safe.