Retrieval-Augmented Generation (RAG) is one of the most widely deployed LLM patterns in production. Understanding when to use RAG versus fine-tuning — and what each one actually does well — is a core expectation in AI/ML system design interviews in 2026.
What the Interviewer Is Testing
Can you reason about the trade-offs between RAG and fine-tuning given a specific use case? Do you know the components of a RAG pipeline well enough to design one? Can you identify when both are needed, when neither is enough, and when plain prompt engineering suffices?
The Problem RAG Solves
LLMs have two fundamental limitations for enterprise use:
- Knowledge cutoff: The model only knows what was in its training data. Events after the cutoff, proprietary company knowledge, internal documentation — none of it is accessible.
- Context window: Even a 128K token context can’t hold an entire knowledge base of millions of documents.
Fine-tuning addresses neither cleanly. You can fine-tune on internal documents, but the model will “memorize” that knowledge imperfectly, may hallucinate by blending learned facts incorrectly, and requires retraining every time documents change. RAG is different: it retrieves relevant documents at inference time and provides them to the model as context — fresh, accurate, and auditable.
RAG Architecture
Offline (indexing):
Documents → Chunker → Embedding Model → Vector DB
Online (inference):
User Query
→ Embedding Model (same model as indexing)
→ Vector DB (ANN search: find top-K relevant chunks)
→ Retrieved chunks + original query → LLM
→ Response (with optional citations)
Chunking Strategies
Documents must be split into chunks before embedding. The chunk is the retrieval unit — the model receives the top-K chunks as context. Chunk size is a fundamental trade-off:
- Too small (50 tokens): Each chunk lacks context. “The limit is 100” — 100 what? Retrieved chunks are confusing without surrounding text.
- Too large (2,000 tokens): The embedding averages over too many concepts. Retrieval recall drops because the embedding isn’t specific enough.
- Sweet spot: 200–500 tokens with overlap (50–100 token overlap between adjacent chunks to avoid cutting mid-sentence).
Chunking strategies:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=80,
separators=["nn", "n", ". ", " ", ""], # prefer paragraph breaks
)
chunks = splitter.split_documents(documents)
Semantic chunking: Split at natural topic boundaries rather than fixed token counts — detect when sentence embeddings shift significantly. Higher quality, more complex to implement. Worth it for long, heterogeneous documents.
Embedding Models
The embedding model converts text to a dense vector. The same model must be used for both indexing and query embedding — otherwise the vector spaces won’t align.
| Model | Dimensions | Cost | Best for |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 (truncatable) | $0.13/M tokens (API) | Quality-first, already on OpenAI stack |
| OpenAI text-embedding-3-small | 1536 | $0.02/M tokens | Cost-sensitive, near text-ada quality |
| BGE-M3 (BAAI) | 1024 | Free (self-hosted) | Multilingual, best open-source option |
| E5-large-v2 | 1024 | Free (self-hosted) | Strong English performance, lightweight |
| Cohere Embed v3 | 1024 | $0.10/M tokens | Strong on domain-specific retrieval |
Vector Databases
The vector DB stores embeddings and supports approximate nearest neighbor (ANN) search — finding the K vectors most similar to the query vector by cosine or dot product similarity.
| Database | Deployment | Best for |
|---|---|---|
| Pinecone | Managed cloud | Production with minimal ops overhead |
| Weaviate | Self-hosted / cloud | Hybrid search (vector + BM25 keyword) |
| Qdrant | Self-hosted / cloud | High-performance, Rust-based, filtering |
| pgvector | PostgreSQL extension | Already on Postgres, smaller scale (<1M vectors) |
| Chroma | Embedded / self-hosted | Prototyping, local development |
| FAISS | Library (no server) | In-process ANN, full control |
Most vector DBs use HNSW (Hierarchical Navigable Small World) as the ANN index — a graph-based index that trades a small recall penalty (<1% with tuned parameters) for orders-of-magnitude faster search than exact brute force.
Hybrid Search
Pure vector search misses exact keyword matches. “What is our API rate limit for endpoint /v2/users?” — vector search might retrieve semantically similar docs but miss the one with the exact number. BM25 keyword search catches exact matches but misses semantic variations.
Hybrid search combines both, fusing scores with Reciprocal Rank Fusion (RRF) or a weighted combination. Weaviate and Elasticsearch/OpenSearch support hybrid search natively. In most production RAG systems, hybrid search outperforms pure vector search by 5–20% on retrieval recall.
RAG vs Fine-tuning Decision Framework
| Scenario | Recommended approach |
|---|---|
| Knowledge changes frequently (docs updated daily) | RAG — just re-index new documents |
| Need citations / attribution (“source: page 12 of policy doc”) | RAG — chunks carry source metadata |
| Large knowledge base (>10K documents) | RAG — can’t fit in context or fine-tune on everything |
| Need consistent output format or tone | Fine-tuning — RAG doesn’t change model behavior |
| Proprietary domain language / jargon | Fine-tuning or continued pretraining |
| Latency-critical (<100ms) — can’t afford retrieval | Fine-tuning — knowledge in weights, no retrieval hop |
| Best of both | RAG + LoRA fine-tuning (on style/format) simultaneously |
RAG Evaluation: RAGAS
Measuring RAG quality requires evaluating two separate components:
- Retrieval quality: Did the retrieval step find relevant chunks? Metrics: context precision (retrieved chunks are relevant), context recall (all relevant information was retrieved).
- Generation quality: Given the retrieved context, did the LLM generate a faithful answer? Metrics: faithfulness (answer is grounded in context — no hallucination), answer relevance (answer actually addresses the question).
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
results = evaluate(
dataset=eval_dataset, # question, answer, contexts, ground_truth
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(results)
# faithfulness: 0.87, answer_relevancy: 0.91, context_precision: 0.79, context_recall: 0.83
Low faithfulness → the LLM is hallucinating beyond the retrieved context. Low context recall → retrieval is missing relevant chunks (tune chunk size, embedding model, or number of retrieved chunks). Low context precision → too much irrelevant context is being included (tighten the similarity threshold).
Common RAG Failures
- Chunking at the wrong granularity: Splitting a 10-page product spec into 1,000-character chunks loses context. Splitting a FAQ into 2,000-token chunks makes retrieval imprecise.
- Embedding model mismatch: Using OpenAI embeddings at index time but switching to a different model later — the vector space changes, retrieval breaks completely.
- Not re-ranking: Top-K retrieval by cosine similarity isn’t always the most relevant. Add a cross-encoder re-ranker (Cohere Rerank, BGE-Reranker) to re-score the top-20 candidates and return the top-5 — significant precision improvement.
- Ignoring metadata filtering: If your knowledge base has documents from multiple clients, a query from Client A must not retrieve Client B’s documents. Filter by metadata at query time, not post-retrieval.
Interview Follow-ups
- How do you handle a question that requires synthesizing information from 10 different documents — more than fits in the LLM’s context?
- Your RAG system hallucinates 15% of the time on legal questions. How do you diagnose and fix this?
- How would you add multi-hop retrieval — a question that requires chaining answers: “What is the name of the CEO of the company that acquired Figma?”
- How do you handle confidential documents — ensuring a user never sees content they don’t have permission to access, even if their query’s embedding is similar to restricted content?
Related ML Topics
- Embeddings and Vector Databases — the retrieval layer of RAG: embedding models, HNSW ANN search, and vector DB comparison
- Fine-tuning LLMs vs Training from Scratch — the decision framework for when RAG alone isn’t enough and fine-tuning is needed
- How Transformer Models Work — the LLM generation component of RAG; understanding attention helps diagnose faithfulness failures
- Classification Metrics — RAGAS faithfulness and context precision are retrieval metrics; same evaluation framework applies
See also: How to Evaluate an LLM — RAGAS metrics (faithfulness, context precision, context recall) provide automated evaluation specifically for RAG pipelines.