AI Integration
AI Integration/senior/freq 5/5

Retrieval-Augmented Generation (RAG)

RAG grounds an LLM in your data: embed → retrieve → augment prompt → generate. Quality is dominated by retrieval, not the model.

ragllmembeddings

Deep dive

Pipeline

  1. Ingest: chunk documents (semantic chunking beats fixed-size for prose).
  2. Embed: choose a model that matches your domain and language; store vectors.
  3. Retrieve: hybrid (dense + BM25) outperforms dense-only in most enterprise corpora.
  4. Rerank: a cross-encoder rerank of top-50 dense hits dramatically improves precision.
  5. Generate: structured prompt with explicit "answer only from context" and citations.

Evaluation

You can't improve what you don't measure. Build a golden set of question/expected-answer pairs. Track retrieval recall@k and generation faithfulness independently — they fail differently.

Cost levers

Cache embeddings, cache top-k results per question template, summarize long contexts before sending to the LLM.

Real-world example

From production

Legal-doc Q&A bot launched with naive RAG: dense-only, 1000-token chunks, no rerank. Users reported "confidently wrong" answers. Switched to hybrid retrieval + cross-encoder rerank + semantic chunking; faithfulness on the golden set rose from 64% to 91%. Same model, same prompt.

Interview questions

2 senior-level
Q1Why does RAG often fail?

Retrieval is the bottleneck, not generation. Bad chunking, dense-only retrieval missing keyword queries, no reranking, and no eval harness. Most teams blame the LLM and tune prompts when the fix is in retrieval.

Q2How do you evaluate a RAG system?

Two stages, separately. Retrieval: recall@k on a labeled query set. Generation: faithfulness (does the answer follow from context?) and answer relevance, scored by humans or an LLM-as-judge with spot checks.

Common mistakes

  • Fixed-size chunking that splits semantic units.

  • Skipping rerank to 'save cost' — usually the highest-ROI step.

  • No eval harness — flying blind on quality.

Trade-offs

  • Bigger context windows reduce the need for tight retrieval but cost more and increase latency.

  • Hybrid retrieval adds infra (BM25 store) but markedly improves quality.

Related