Skip to content
6 min

RAG is mostly retrieval

Most RAG quality is a search problem with an LLM-shaped front end. The choice of LLM matters at the margin (better instruction-following, fewer hallucinations on ambiguous chunks). The choice of retriever, the chunking strategy, the embedding model, and the indexing approach matter at the centre.

The diagnostic to run on a stuck RAG system: hand-pick the perfect chunk for ten test questions and feed it to the model. Ten correct answers, ten times means the LLM is fine and the retrieval is the bottleneck. Most stuck projects fail this diagnostic and then spend months tuning the LLM anyway.

This is the most common misallocation of effort in RAG. Teams pour months into prompt engineering, model swaps, fine-tuning, and chain-of-thought experiments, while the actual ceiling is set by whether the right chunk was retrieved before the LLM ever ran. The LLM can polish a good chunk into a great answer. It cannot reconstruct information that wasn't in its context.

A sequence to test stuck systems

A sequence worth running on stuck RAG systems. First, run an oracle test: manually find the right chunk for 30 questions and compare the LLM's answers when given that chunk. If they're all good, your retrieval is the bottleneck. Don't touch the LLM until retrieval is fixed.

Second, instrument retrieval. Top-k recall against a labelled set is the metric that matters. Are the gold chunks in the top 5? In the top 20? If not, no LLM choice will save you.

Third, fix the chunking. Most RAG systems chunk naively (every 500 tokens, fixed window) and assume the LLM will figure it out. It usually can't. Chunks should match the structure of the document (sections, paragraphs, with overlap) and carry metadata (source, section header, version) into the LLM context.

Fourth, consider hybrid search. Pure vector search works well for paraphrased queries; pure keyword search works well for jargon-heavy enterprise content. Most production-grade RAG uses both and re-ranks. BM25 plus vector plus a cross-encoder re-ranker on the top 50 is a recipe that beats single-method approaches consistently.

Fifth, only after retrieval is solid: tune the LLM. The order matters. A prompt-tuned model on bad retrieval has a low ceiling. Good retrieval with a moderate LLM beats brilliant LLM with bad retrieval consistently.

The other failure mode is over-confidence in vector search. Vector search is genuinely useful when the query and the answer are paraphrases of each other (a customer asking 'how do I return something?' and the docs saying 'product return process'). It's much weaker when the answer requires reasoning across multiple chunks, when the query uses specific identifiers (case numbers, product codes), or when the corpus has near-duplicate documents and the wrong version gets retrieved. Each of these is a real failure mode that no amount of LLM choice fixes.

A sanity check worth running: every RAG project should have a labelled retrieval eval set, separate from the final-answer eval set. The retrieval set says, for each query, which chunks should be retrieved. You measure recall and rank, not just final answer correctness. This is the equivalent of unit-testing the search before integration-testing the whole pipeline.

The right time to think about which LLM to use is after the retrieval gets right answers. The wrong time is week one. Most teams do it the wrong way around because the LLM is the visible part of RAG. The retrieval is the boring part. The boring part is where the quality lives.

// The artefact
# rag/retrieval_eval.py: unit-test the search before integration-testing the pipeline
def hit_rate_at_k(queries: list[Query], k: int = 10) -> float:
    """Fraction of queries where at least one gold chunk appears in the top-k retrieved set.
    A hit-rate form of recall - sometimes called Hit@k - simpler than per-query strict recall
    when most queries have one or two relevant chunks."""
    hits = 0
    for q in queries:
        retrieved = retriever.search(q.text, k=k)
        retrieved_ids = {chunk.id for chunk in retrieved}
        if any(gold in retrieved_ids for gold in q.gold_chunk_ids):
            hits += 1
    return hits / len(queries)

# If hit-rate@10 is below ~0.85, no amount of LLM tuning will fix the answer quality.

Hit-rate@k against a labelled retrieval set. Measure this first. The model can't quote what it never retrieved.