Skip to content
6 min

Synthetic data: when it works and when it traps

Synthetic data has real uses and real failure modes, and the difference between the two determines whether the project succeeds. The most expensive trap: a team with 5,000 labelled documents and a target of 50,000 spends three weeks generating LLM-augmented variants, trains a model on the result, watches the eval scores look great, ships, and discovers production accuracy is worse than a baseline trained on the original 5,000.

This is one of the standard outcomes of naive synthetic data generation, and it's almost never anticipated. The model trained on synthetic data learned the LLM's biases, not the original task. The eval set, also generated synthetically, validated those biases. Production data didn't share them.

Synthetic data is a real tool. It has real uses. It also has failure modes that are easy to walk into. Put plainly: synthetic data substitutes for some kinds of label shortage and makes others worse, and the difference between the two is what determines whether the project succeeds.

When synthetic data helps

Three contexts where synthetic data clearly helps.

Augmenting rare classes. If the real data has 100 fraud cases and 100,000 non-fraud cases, generating plausible synthetic fraud examples (or oversampling the real ones with light perturbation) is a known-good technique. The synthetic data lives next to the real data; the model sees both. The eval set is real data only.

Privacy-preserving training. When the original data can't leave a regulated environment, synthetic data trained to match its distribution can be exported. There's a real literature here (PATE, differential privacy, GAN-based generation). The trade-off is that the synthetic model is downstream of the real one's quality and you have to track that.

Edge case generation for testing. Hard-to-find cases (the unusual document layout, the rare query, the adversarial input) can be generated to size the test set. These cases live in the eval set, not the training set, and their job is to surface failures, not to train the model.

When synthetic data traps you

Three contexts where synthetic data traps you.

Substituting for real labels in the central use case. If the production task is 'classify customer support emails' and the labels are scarce, generating 50,000 synthetic emails with an LLM teaches the model to be good at LLM-style emails, not customer ones. The distribution shift is real, large, and silent.

Distilling the eval set from the same source as the training set. If you generate both training and eval data from the same LLM, the eval is contaminated. The model passes the eval and fails on real data. The fix is that the eval set is always real, no exceptions.

Closing the loop on the LLM. If the LLM that generates synthetic data is also part of the production system (RAG with the same model, agent with the same model), you've created a feedback loop where the model is being fine-tuned on its own output. Quality drift is rapid and hard to detect.

Another way to think about it: synthetic data should expand the variance of the training set, not substitute for the centre of mass. The centre of mass is real labels. The variance is where synthetic helps: edge cases, rare combinations, distribution coverage. Generate synthetic data to cover the corners; don't generate it to fill the middle.

A second useful framing: every synthetic-data project should have a real-data control. Train one model on real-only, one on real-plus-synthetic, and compare on real eval data. If the synthetic version doesn't beat the real-only baseline, you've spent compute generating noise. Half the time, this control is the surprise of the project.

Synthetic data is a complement to real data, not a substitute. The teams that use it well treat it as augmentation: a way to expose the model to variation it would have eventually seen anyway. The teams that use it badly treat it as label-printer: a way to skip the unglamorous work of getting real labels. The first works. The second usually doesn't, and it fails late, in production, where the diagnosis is hardest.

// The artefact
# data/synthetic_control.py: train both, compare on real eval - or you don't know
def evaluate_synthetic_lift(real_train, synth_train, real_eval) -> dict:
    real_only = train(real_train)
    real_plus_synth = train(real_train + synth_train)
    return {
        "real_only_score":   score(real_only,        real_eval),
        "real_plus_synth":   score(real_plus_synth,  real_eval),
        "lift":              score(real_plus_synth,  real_eval)
                           - score(real_only,        real_eval),
    }

# If the lift is negative or near-zero, the synthetic data isn't helping.
# Stop generating more of it; spend the compute on real labels.

A real-only control next to every synthetic-augmented model. Half the time the synthetic data turns out to be noise dressed up as signal.