Skip to content
6 min

Time-series cross-validation: rolling, not random

Random k-fold cross-validation is the wrong validation strategy for time series, and most production forecasting failures that get blamed on 'the data changed' are actually 'the validation was wrong all along'. The diagnostic shape: a cross-validation MAPE of 4 percent, then a production MAPE of 12 percent and drifting. The CV gave the model permission to look at the future during training; production didn't.

The mechanism is well-documented. Scikit-learn's KFold randomly partitions data into folds. For tabular data this is fine. For time series it is a mistake that quietly invalidates everything you measured.

Random k-fold mixes past and future. A fold that 'tests' on rows from January 2024 has been 'trained' on rows from February through December 2024. Every model in that setup has access to information that, in production, won't exist when it's making a prediction. The reported accuracy is the accuracy of a model that can see the future. Production models can't. Production accuracy is therefore lower, sometimes catastrophically.

The answer is rolling-origin cross-validation, sometimes called walk-forward validation or expanding-window CV. Train on data up to time t. Test on data from t+1 to t+horizon. Roll t forward by one period. Repeat. Average the errors.

This reproduces the deployment pattern: the model is making predictions about a future it hasn't seen yet. Every fold respects causality. Every fold is a small simulation of what the model will actually face in production.

Three traps random k-fold hides

Three specific traps that random k-fold hides and rolling CV exposes.

The trend trap. If the underlying series has a trend (sales growing 5% a year), random k-fold will train on rows from years that are similar in level to the test rows. The model never has to extrapolate. Production demands extrapolation; random k-fold validates a problem that's easier than the production problem. The result: training MAPE looks good, production MAPE drifts.

The seasonality trap. A model evaluated on random rows from across years sees every season in training. A model evaluated on a held-out future quarter sees only the seasons the training set already covered. The first measurement is roughly the wrong thing; the second measurement is what matters.

The regime-change trap. A model trained on 2019-2023 data is tested on 2020 (where COVID happened). Random k-fold's '2020 fold' was trained on 2019, 2021, 2022, 2023, including the data that lets the model interpolate around 2020. Rolling CV treats 2020 as out-of-sample, which is what production saw at the time, and reveals how badly the model handles regime changes.

Two practical specifics. First, the test horizon should match the production decision horizon. If you forecast 4 weeks out and act on it, the test fold is 4 weeks. Don't validate on 1-week errors and ship a 4-week-out decision system. Second, have a final out-of-sample test set that nothing else has seen, not even the rolling CV. The rolling CV is the validation; the final holdout is the integrity check before deployment.

A diagnostic worth running: every forecasting project should report at least three error numbers - in-sample (training), rolling CV, and final holdout. The gap between rolling CV and final holdout should be small. The gap between in-sample and rolling CV is the model's optimism. The teams that report only one number are usually reporting the most flattering of the three.

Random k-fold for time series is a category error, and most production forecasting failures that get blamed on 'the data changed' are actually 'the validation was wrong all along'. The fix is mechanical and well-known. The reason it doesn't get done is that random k-fold is the default and rolling CV requires extra code.

// The artefact
# forecast/cv.py: rolling-origin CV - every fold respects causality
def rolling_origin_cv(series, model_class, horizon, min_train, step=1):
    """Walk forward through the series, never letting the model see the future."""
    errors = []
    for t in range(min_train, len(series) - horizon, step):
        train = series[:t]
        test  = series[t : t + horizon]
        model = model_class().fit(train)
        pred  = model.predict(horizon=horizon)
        errors.append(mape(test, pred))
    return np.mean(errors), np.std(errors)

# A flat list of average errors that actually correspond to what production sees.

Walk-forward through the series. Every train/test split is one production deployment in miniature.