What is RAG? Retrieval-Augmented Generation vs Fine-tuning

⏱ 5 min read

Retrieval-Augmented Generation (RAG) is one of the most widely deployed LLM patterns in production. Understanding when to use RAG versus fine-tuning — and what each one actually does well — is a core expectation in AI/ML system design interviews in 2026.

What the Interviewer Is Testing

Can you reason about the trade-offs between RAG and fine-tuning given a specific use case? Do you know the components of a RAG pipeline well enough to design one? Can you identify when both are needed, when neither is enough, and when plain prompt engineering suffices?

The Problem RAG Solves

LLMs have two fundamental limitations for enterprise use:

Knowledge cutoff: The model only knows what was in its training data. Events after the cutoff, proprietary company knowledge, internal documentation — none of it is accessible.
Context window: Even a 128K token context can’t hold an entire knowledge base of millions of documents.

Fine-tuning addresses neither cleanly. You can fine-tune on internal documents, but the model will “memorize” that knowledge imperfectly, may hallucinate by blending learned facts incorrectly, and requires retraining every time documents change. RAG is different: it retrieves relevant documents at inference time and provides them to the model as context — fresh, accurate, and auditable.

RAG Architecture

Offline (indexing):
  Documents → Chunker → Embedding Model → Vector DB

Online (inference):
  User Query
      → Embedding Model (same model as indexing)
      → Vector DB (ANN search: find top-K relevant chunks)
      → Retrieved chunks + original query → LLM
      → Response (with optional citations)

Chunking Strategies

Documents must be split into chunks before embedding. The chunk is the retrieval unit — the model receives the top-K chunks as context. Chunk size is a fundamental trade-off:

Too small (50 tokens): Each chunk lacks context. “The limit is 100” — 100 what? Retrieved chunks are confusing without surrounding text.
Too large (2,000 tokens): The embedding averages over too many concepts. Retrieval recall drops because the embedding isn’t specific enough.
Sweet spot: 200–500 tokens with overlap (50–100 token overlap between adjacent chunks to avoid cutting mid-sentence).

Chunking strategies:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=80,
    separators=["nn", "n", ". ", " ", ""],  # prefer paragraph breaks
)
chunks = splitter.split_documents(documents)

Semantic chunking: Split at natural topic boundaries rather than fixed token counts — detect when sentence embeddings shift significantly. Higher quality, more complex to implement. Worth it for long, heterogeneous documents.

Embedding Models

The embedding model converts text to a dense vector. The same model must be used for both indexing and query embedding — otherwise the vector spaces won’t align.

Model	Dimensions	Cost	Best for
OpenAI text-embedding-3-large	3072 (truncatable)	$0.13/M tokens (API)	Quality-first, already on OpenAI stack
OpenAI text-embedding-3-small	1536	$0.02/M tokens	Cost-sensitive, near text-ada quality
BGE-M3 (BAAI)	1024	Free (self-hosted)	Multilingual, best open-source option
E5-large-v2	1024	Free (self-hosted)	Strong English performance, lightweight
Cohere Embed v3	1024	$0.10/M tokens	Strong on domain-specific retrieval

Vector Databases

The vector DB stores embeddings and supports approximate nearest neighbor (ANN) search — finding the K vectors most similar to the query vector by cosine or dot product similarity.

Database	Deployment	Best for
Pinecone	Managed cloud	Production with minimal ops overhead
Weaviate	Self-hosted / cloud	Hybrid search (vector + BM25 keyword)
Qdrant	Self-hosted / cloud	High-performance, Rust-based, filtering
pgvector	PostgreSQL extension	Already on Postgres, smaller scale (<1M vectors)
Chroma	Embedded / self-hosted	Prototyping, local development
FAISS	Library (no server)	In-process ANN, full control

Most vector DBs use HNSW (Hierarchical Navigable Small World) as the ANN index — a graph-based index that trades a small recall penalty (<1% with tuned parameters) for orders-of-magnitude faster search than exact brute force.

Hybrid Search

Pure vector search misses exact keyword matches. “What is our API rate limit for endpoint /v2/users?” — vector search might retrieve semantically similar docs but miss the one with the exact number. BM25 keyword search catches exact matches but misses semantic variations.

Hybrid search combines both, fusing scores with Reciprocal Rank Fusion (RRF) or a weighted combination. Weaviate and Elasticsearch/OpenSearch support hybrid search natively. In most production RAG systems, hybrid search outperforms pure vector search by 5–20% on retrieval recall.

RAG vs Fine-tuning Decision Framework

Scenario	Recommended approach
Knowledge changes frequently (docs updated daily)	RAG — just re-index new documents
Need citations / attribution (“source: page 12 of policy doc”)	RAG — chunks carry source metadata
Large knowledge base (>10K documents)	RAG — can’t fit in context or fine-tune on everything
Need consistent output format or tone	Fine-tuning — RAG doesn’t change model behavior
Proprietary domain language / jargon	Fine-tuning or continued pretraining
Latency-critical (<100ms) — can’t afford retrieval	Fine-tuning — knowledge in weights, no retrieval hop
Best of both	RAG + LoRA fine-tuning (on style/format) simultaneously

RAG Evaluation: RAGAS

Measuring RAG quality requires evaluating two separate components:

Retrieval quality: Did the retrieval step find relevant chunks? Metrics: context precision (retrieved chunks are relevant), context recall (all relevant information was retrieved).
Generation quality: Given the retrieved context, did the LLM generate a faithful answer? Metrics: faithfulness (answer is grounded in context — no hallucination), answer relevance (answer actually addresses the question).

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

results = evaluate(
    dataset=eval_dataset,  # question, answer, contexts, ground_truth
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(results)
# faithfulness: 0.87, answer_relevancy: 0.91, context_precision: 0.79, context_recall: 0.83

Low faithfulness → the LLM is hallucinating beyond the retrieved context. Low context recall → retrieval is missing relevant chunks (tune chunk size, embedding model, or number of retrieved chunks). Low context precision → too much irrelevant context is being included (tighten the similarity threshold).

Common RAG Failures

Chunking at the wrong granularity: Splitting a 10-page product spec into 1,000-character chunks loses context. Splitting a FAQ into 2,000-token chunks makes retrieval imprecise.
Embedding model mismatch: Using OpenAI embeddings at index time but switching to a different model later — the vector space changes, retrieval breaks completely.
Not re-ranking: Top-K retrieval by cosine similarity isn’t always the most relevant. Add a cross-encoder re-ranker (Cohere Rerank, BGE-Reranker) to re-score the top-20 candidates and return the top-5 — significant precision improvement.
Ignoring metadata filtering: If your knowledge base has documents from multiple clients, a query from Client A must not retrieve Client B’s documents. Filter by metadata at query time, not post-retrieval.

Interview Follow-ups

How do you handle a question that requires synthesizing information from 10 different documents — more than fits in the LLM’s context?
Your RAG system hallucinates 15% of the time on legal questions. How do you diagnose and fix this?
How would you add multi-hop retrieval — a question that requires chaining answers: “What is the name of the CEO of the company that acquired Figma?”
How do you handle confidential documents — ensuring a user never sees content they don’t have permission to access, even if their query’s embedding is similar to restricted content?

Embeddings and Vector Databases — the retrieval layer of RAG: embedding models, HNSW ANN search, and vector DB comparison
Fine-tuning LLMs vs Training from Scratch — the decision framework for when RAG alone isn’t enough and fine-tuning is needed
How Transformer Models Work — the LLM generation component of RAG; understanding attention helps diagnose faithfulness failures
Classification Metrics — RAGAS faithfulness and context precision are retrieval metrics; same evaluation framework applies

See also: How to Evaluate an LLM — RAGAS metrics (faithfulness, context precision, context recall) provide automated evaluation specifically for RAG pipelines.