Question 1

What are the tradeoffs between fixed-size and semantic chunking strategies in a RAG system?

Accepted Answer

Fixed-size chunking splits documents every N tokens (typically 256-512) with an overlap of 10-20% to avoid cutting mid-sentence. It is simple, deterministic, and fast, but chunks may break mid-concept, harming retrieval precision. Semantic chunking uses sentence embeddings or topic-boundary detection to split at natural semantic boundaries — paragraph breaks, topic shifts detected by cosine distance drops between consecutive sentence embeddings. Semantic chunks preserve coherent units of meaning, improving answer faithfulness, but chunk sizes vary widely (complicating batch embedding) and the splitting step itself adds latency. A practical middle ground is recursive character splitting with paragraph > sentence > word hierarchy (LangChain's RecursiveCharacterTextSplitter), then optionally merging small chunks with their neighbours to hit a target token range.

Question 2

How do you select the right embedding model for a RAG retrieval system?

Accepted Answer

Key selection criteria: (1) Benchmark performance on MTEB (Massive Text Embedding Benchmark) for your task category — retrieval vs. clustering vs. classification scores differ. (2) Max sequence length — models like text-embedding-3-large support 8191 tokens; shorter models (512 tokens) truncate long chunks. (3) Embedding dimensionality — higher dims (3072) improve quality at the cost of storage and ANN search latency; Matryoshka Representation Learning (MRL) models (OpenAI text-embedding-3) let you truncate to lower dims with graceful degradation. (4) Domain fit — general models underperform on legal, medical, or code corpora; fine-tuning on domain pairs (question, relevant passage) with contrastive loss (MultipleNegativesRankingLoss) can close the gap. (5) Inference cost and latency — self-hosted models (e5-large-v2, BGE-M3) avoid API costs but require GPU serving infrastructure.

Question 3

How does hybrid retrieval combining dense embeddings and sparse BM25 work in RAG?

Accepted Answer

Dense retrieval (ANN over embeddings) excels at semantic similarity but misses exact keyword matches — a query for a rare product SKU or a person's name may retrieve semantically related but wrong documents. BM25 is a term-frequency/inverse-document-frequency sparse model that scores exact term overlap and handles keyword precision well but has no semantic generalisation. Hybrid retrieval runs both in parallel and merges ranked lists using Reciprocal Rank Fusion (RRF): score(d) = sum(1 / (k + rank_i(d))) across retrievers, where k=60 is a smoothing constant. RRF is rank-based so it normalises score scale differences without tuning weights. Alternatively, linear combination of normalised scores with a tunable alpha gives more control. Weaviate, Elasticsearch, and Qdrant support hybrid search natively. In practice, hybrid consistently outperforms either method alone on BEIR benchmarks by 3-8 NDCG points.

Question 4

How do you manage context window limits when retrieved chunks exceed the LLM's context in a RAG system?

Accepted Answer

Strategies in order of complexity: (1) Truncation — simply drop lowest-ranked chunks until the token budget fits; works for most cases since top-k chunks carry the most relevant signal. (2) Reranking — use a cross-encoder (e.g., Cohere Rerank, BGE-reranker) to score all retrieved chunks against the query and keep only the top N by relevance, then fill the context window greedily. (3) Map-reduce — split chunks into groups, summarise each group with the LLM, then synthesise summaries; useful for very long documents but doubles LLM calls. (4) Lost-in-the-middle mitigation — research shows LLMs recall information better at the start and end of context; reorder chunks so highest-relevance chunks appear first and last, not in the middle. (5) Dynamic context allocation — track token counts per chunk at index time so retrieval can return exactly the right number of chunks to fill a target budget without per-request counting overhead.

Question 5

What metrics are used to evaluate a RAG system, specifically faithfulness and answer relevance?

Accepted Answer

The RAGAS framework defines the core RAG evaluation metrics: (1) Faithfulness measures whether every claim in the generated answer is supported by the retrieved context — computed by prompting an LLM to extract claims from the answer, then checking each claim against the context; score = supported claims / total claims. Catches hallucination introduced by the generator. (2) Answer Relevance measures how directly the answer addresses the question — the evaluator LLM generates N synthetic questions from the answer and computes mean cosine similarity between those questions and the original; low score indicates the answer is off-topic or padded. (3) Context Precision measures whether retrieved chunks are actually relevant (signal-to-noise in retrieval). (4) Context Recall measures whether all ground-truth answer statements are attributable to the retrieved context (retrieval completeness). Together these four form a 2x2 covering retrieval quality and generation quality independently, enabling targeted debugging of whether failures are in the retriever or the reader.

Low Level Design: RAG (Retrieval-Augmented Generation) System

Introduction

Document Ingestion Pipeline

Chunking Strategy

Embedding Generation

Vector Retrieval

Context Assembly

LLM Prompt Construction

Evaluation