7 min

Forecasting at the wrong granularity is the silent killer

A retailer asks for a sales forecast. The data scientist trains an excellent model on weekly aggregate data, achieves a 4 percent MAPE on holdout, and presents the result. The buyers ignore it. They keep ordering off intuition.

The forecast is technically correct. It is also useless. The buying decisions are made daily, per SKU, per store. A weekly aggregate forecast at the regional level cannot tell you whether to ship 80 units of an SKU to Store 14 on Wednesday. The model is solving a different problem from the one its users are trying to solve.

This is the most common silent failure of forecasting projects, and it almost never shows up in the validation metrics. The granularity at which the forecast is made decides whether the forecast is decision-grade. The granularity at which the model performs well is rarely the granularity at which the decision is made. Bridging that gap is most of the work.

There is a useful framing. Every forecast has two granularities: the decision granularity (the unit at which the action is taken) and the data granularity (the unit at which the signal is collected). They are almost never the same. The decision granularity is set by the business: per-SKU per-day, per-route per-week, per-customer per-month. The data granularity is set by what is logged. The forecast granularity is your choice, and that choice has consequences.

If you forecast at the decision granularity directly, you face a signal-to-noise problem. Daily SKU-level demand is dominated by zeros, weekend effects, promotional bursts, and randomness. Most models fit poorly. You get high MAPE. The temptation is to aggregate up to weekly category level, where the model is happier, but which is the wrong thing to forecast.

If you forecast at the data granularity (whatever is natural for the model), you face a translation problem. Weekly category forecasts have to be disaggregated down to daily SKU level using historical proportions, which embeds last quarter's mix into next quarter's forecast. When mix changes (a new product, a discontinued line, a category shift), the disaggregation is wrong, even though the top-line forecast was right.

The honest answer

The answer is hierarchical reconciliation: forecast at multiple levels, then reconcile so the forecasts agree. The reconciliation pass surfaces disagreement. When the per-SKU forecasts sum to a number that disagrees with the per-category forecast, that's a signal - usually a signal that one of the levels has a bias the other doesn't, and your job is to figure out which to trust. There are well-documented methods (MinT, OLS reconciliation, top-down vs bottom-up). The point is to do it deliberately, not to forecast at one level and disaggregate by hand.

Three forms of the granularity trap

Three concrete forms of the granularity trap to look for. First, the holiday smear. A forecast trained on data that includes Black Friday will average that bump across all of November unless you explicitly handle the holiday as a feature or as a separate model. The MAPE on the rest of the month looks fine. The order quantities for that one week are catastrophically wrong. Second, the sparse-SKU collapse. Long-tail SKUs that move three units a month are unforecastable per-SKU, no matter the model. The right move is usually to forecast at category level and accept manual override for the long tail, not to keep training a hierarchical model that's hallucinating. Third, the cross-aggregation paradox. Sometimes the forecast is excellent at SKU-day level and bad at category-week level (because individual errors are correlated). Sometimes it's the reverse. You only know which by computing both.

The validation discipline that catches this is straightforward and almost never done. For every forecast, compute accuracy at three granularities: the model's natural one, the decision granularity, and a level above the decision granularity. Report all three. The user should know which one to read. If the answer to 'is it accurate?' is 'at what level?', you're being honest. If the answer is one number, you're hiding which level the model actually fits.

Forecasts are not just numbers; they are tools that fit a decision. Match the granularity to the decision and the conversation about whether the forecast is good becomes a conversation about whether the decision is right. That is the conversation worth having.

// The artefact

# forecast/eval.py: report accuracy at every level the decision touches
def multi_level_mape(actuals: pd.DataFrame, forecasts: pd.DataFrame) -> dict:
    levels = {
        "sku-day":      ["sku", "date"],          # decision granularity
        "category-day": ["category", "date"],     # one level up
        "category-week":["category", "iso_week"], # the model's happy place
    }
    return {
        name: mape(actuals.groupby(keys).y.sum(), forecasts.groupby(keys).y_hat.sum())
        for name, keys in levels.items()
    }

MAPE at three levels. If the headline number is one of them, the user should know which one - and which they actually need.

← Back to Insights