Retrieval-Augmented Generation (RAG)
RAG grounds an LLM in your data: embed → retrieve → augment prompt → generate. Quality is dominated by retrieval, not the model.
Deep dive
Pipeline
- Ingest: chunk documents (semantic chunking beats fixed-size for prose).
- Embed: choose a model that matches your domain and language; store vectors.
- Retrieve: hybrid (dense + BM25) outperforms dense-only in most enterprise corpora.
- Rerank: a cross-encoder rerank of top-50 dense hits dramatically improves precision.
- Generate: structured prompt with explicit "answer only from context" and citations.
Evaluation
You can't improve what you don't measure. Build a golden set of question/expected-answer pairs. Track retrieval recall@k and generation faithfulness independently — they fail differently.
Cost levers
Cache embeddings, cache top-k results per question template, summarize long contexts before sending to the LLM.
Real-world example
From productionLegal-doc Q&A bot launched with naive RAG: dense-only, 1000-token chunks, no rerank. Users reported "confidently wrong" answers. Switched to hybrid retrieval + cross-encoder rerank + semantic chunking; faithfulness on the golden set rose from 64% to 91%. Same model, same prompt.
Interview questions
2 senior-levelQ1Why does RAG often fail?▾
Retrieval is the bottleneck, not generation. Bad chunking, dense-only retrieval missing keyword queries, no reranking, and no eval harness. Most teams blame the LLM and tune prompts when the fix is in retrieval.
Q2How do you evaluate a RAG system?▾
Two stages, separately. Retrieval: recall@k on a labeled query set. Generation: faithfulness (does the answer follow from context?) and answer relevance, scored by humans or an LLM-as-judge with spot checks.
Common mistakes
Fixed-size chunking that splits semantic units.
Skipping rerank to 'save cost' — usually the highest-ROI step.
No eval harness — flying blind on quality.
Trade-offs
Bigger context windows reduce the need for tight retrieval but cost more and increase latency.
Hybrid retrieval adds infra (BM25 store) but markedly improves quality.