Skip to content
5 min

Before you fine-tune, three things to try first

Fine-tuning is the wrong reflex when a feature on top of an LLM produces the wrong answer too often. The instinct is to retrain on the failure cases; the upstream interventions that would solve the problem more cheaply tend to get skipped.

Fine-tuning is expensive, introduces versioning complexity, and locks you into a specific base model at a specific point in time. A year from now, when the underlying model has improved by forty percent, you can't trivially take advantage of that improvement; you have to redo the fine-tune on the new base. You've also added an evaluation burden, because now you have to prove the fine-tuned version is better than the default at regression-time, on every task you care about. None of this is insurmountable. All of it is cost that didn't exist before you fine-tuned.

What to try first

Most of the time, the right answer is to look upstream. Three interventions, in order.

Prompting. Specifically, careful prompting with few-shot examples, chain-of-thought instructions, and explicit output format specifications. A five-hundred-word system prompt with three worked examples consistently outperforms a fine-tune on the same problem for most tasks that don't require specialised knowledge. The iteration loop is fifteen minutes per change, not hours. The cost per experiment is near zero. And the result generalises across base models: a well-crafted prompt works on the current-generation model and on the next one, too.

The common objection: 'we tried prompting, it didn't work.' Usually what this means is that the team tried three prompts over an afternoon. Serious prompt engineering involves twenty to fifty iterations, a held-out evaluation set, and explicit reasoning about which of the inputs are getting which kinds of errors. If that work hasn't been done, 'prompting didn't work' is not yet true.

Retrieval. If the model is producing answers that are factually wrong, the problem is usually not that the model lacks the capability to reason about the domain, it's that it doesn't have access to the specific facts it needs. Fine-tuning to teach the model facts is both inefficient (you're using compute to memorise what a database could store) and stale (the facts go out of date immediately). Retrieval-augmented generation solves this by pulling relevant context into the prompt at inference time. The model doesn't need to know your company's policy on returns; it needs to have the policy in its context window when answering the question.

Retrieval done well requires investment: a decent embedding model, a well-structured corpus, and a competent query-to-retrieval pipeline. Retrieval done poorly is worse than no retrieval, because the model uses wrong context confidently. But well-done retrieval is the single highest-leverage intervention for most enterprise LLM use cases, and almost always beats fine-tuning on the same task.

Decomposition. If the task you're asking the model to do is complex, break it into smaller steps. A chain of three small prompts ('extract the entities in this text', 'for each entity, classify the sentiment', 'summarise the aggregate sentiment') will routinely outperform one complex prompt that tries to do all three at once. The model's failures at each individual step are correctable and inspectable. The model's failures on the composite step are a mystery.

This matters because the natural instinct when prompting doesn't work is to make the prompt more detailed. Detail does sometimes help. But more often, detail is a proxy for a task that should have been decomposed. The thirty-line prompt trying to do seven things at once is almost always worse than seven small prompts doing one thing each.

When fine-tuning is actually right

When fine-tuning is actually the right answer, there are three signals. You need output in a very specific format, consistently, and prompting produces the format ninety percent of the time but not ninety-nine. You need to reduce inference cost or latency for a workload you've already made work with prompting, and the cost of a smaller fine-tuned model is less than the cost of calling a larger model repeatedly. Or you have a lot of high-quality labelled examples of the exact task, and the base model's performance on that task is demonstrably limited by capability rather than context or instructions.

Most teams reach for fine-tuning before any of these signals are present. The result is a substantial engineering investment in solving a problem that could have been solved more cheaply by spending two more weeks on prompting and retrieval.

Fine-tuning has its place. It is not the first place.

// The artefact
# prompt/fewshot.py: try this for a week before fine-tuning anything
def classify(text: str, examples: list[Example]) -> str:
    shots = "\n\n".join(f"Input: {e.input}\nLabel: {e.label}" for e in examples)
    prompt = f"{INSTRUCTIONS}\n\n{shots}\n\nInput: {text}\nLabel:"
    return llm.generate(prompt).strip()

Twenty iterations of few-shot prompting beats most fine-tunes, at near-zero cost per experiment.