What is a RAG pipeline and what are its main components?

Retrieval-Augmented Generation (RAG) is a pattern that grounds LLM responses in external knowledge by retrieving relevant documents at query time and including them in the prompt context. The main components are an ingestion pipeline that chunks and embeds documents into a vector store, a retrieval layer that searches for the top-k most relevant chunks given a query embedding, and a generation layer that passes the retrieved context along with the user question to the LLM.

How does document chunking work for optimal retrieval in a RAG system?

Documents are split into overlapping fixed-size chunks or split along semantic boundaries such as paragraphs and sections. Overlap between adjacent chunks prevents losing context that straddles a boundary. Each chunk is embedded using a dense encoder model to produce a vector representation stored in the vector database alongside metadata like source URL, section title, and creation date that can be used for filtering.

How does the retrieval step find the most relevant chunks for a query?

The user query is embedded with the same encoder model used during ingestion. An approximate nearest-neighbor search such as HNSW retrieves the top-k chunks by cosine similarity in milliseconds. Optional metadata filters narrow the search space before the ANN step. A reranker model can then re-score the candidates and select a smaller final set to include in the prompt, improving precision.

How does a RAG pipeline prevent hallucinations and improve answer accuracy?

By supplying retrieved passages as grounding context the LLM is constrained to synthesize answers from verified source material rather than relying solely on parametric memory. Prompts can instruct the model to cite the passage it is drawing from and to state when the answer is not found in the context. Post-generation faithfulness checks can compare the answer against retrieved chunks and flag or block responses that introduce unsupported claims.

Low Level Design: RAG Pipeline Service

⏱ 7 min read

What Is a RAG Pipeline?

Retrieval-Augmented Generation (RAG) is an architecture that improves LLM output quality by retrieving relevant documents from a knowledge base at query time and injecting them into the prompt context. A RAG pipeline service manages the full lifecycle: ingesting and indexing documents, retrieving relevant chunks at query time, and assembling the final prompt for the LLM.

RAG pipeline low level design is a high-frequency interview question for ML platform and AI infrastructure roles. Candidates are expected to walk through each stage with concrete data models and design tradeoffs.

Requirements

Functional

Document ingestion: accept documents in multiple formats (PDF, HTML, plain text, Markdown).
Chunking: split documents into semantically coherent chunks.
Embedding generation: produce dense vector embeddings for each chunk.
Vector store indexing: store and index embeddings for approximate nearest neighbor (ANN) search.
Retrieval: given a query, return the top-k most relevant chunks.
LLM prompt assembly: format retrieved context and user query into a prompt.
Incremental updates: support adding, updating, and deleting documents without full re-index.

Non-Functional

Retrieval latency < 200 ms p99.
Ingestion pipeline throughput: process thousands of documents per hour.
Embedding consistency: all chunks for the same document use the same embedding model version.
Scalable to billions of chunks across large knowledge bases.

High-Level Architecture

[Ingestion API]
  |
  v
[Document Parser]
  |
  v
[Chunker]
  |
  v
[Embedding Service]
  |
  v
[Vector Store] <------[Retrieval Service]<---[Query API]
                                |
                                v
                        [Prompt Assembler]
                                |
                                v
                          [LLM Gateway]

Stage 1: Document Parsing

The parser converts raw document bytes into clean plain text with metadata extraction.

PDF: use a PDF extraction library (pdfminer, PyMuPDF) to extract text, preserving section headers.
HTML: strip tags, extract main content area (Readability algorithm), preserve heading structure.
Markdown / plain text: minimal transformation.

Extracted metadata (title, source URL, author, creation date, document_id) is stored in a relational metadata store (Postgres) alongside a reference to the raw document in object storage (S3).

Stage 2: Chunking

Chunking splits documents into passages that fit within the embedding model’s context window (typically 512 tokens) while preserving semantic coherence.

Chunking Strategies

Fixed-size with overlap: split every N tokens with a K-token overlap between adjacent chunks. Simple and predictable; good default.
Sentence boundary: split at sentence ends, grouping up to N tokens per chunk. Preserves sentence integrity.
Semantic chunking: detect topic shifts using embedding similarity between consecutive sentences; split at low-similarity boundaries. Higher quality, higher compute cost.
Structural chunking: split at document section headers. Best for well-structured documents.

A Chunk record is produced for each segment:

{
  chunk_id: UUID,
  document_id: UUID,
  chunk_index: int,
  text: string,
  token_count: int,
  start_char: int,
  end_char: int,
  metadata: map    // inherits document metadata
}

Stage 3: Embedding Generation

Each chunk is encoded into a dense vector by an embedding model (e.g., text-embedding-3-large, BGE-M3, E5-large).

Chunks are batched (e.g., 64 chunks per API call) to maximize throughput.
Embedding calls are made to an embedding service (OpenAI Embeddings API, a self-hosted model, or the LLM Gateway).
The embedding model version and dimension are stored alongside each vector to support model migration.
Failed chunks are queued for retry; poison-pill chunks (e.g., empty after parsing) are logged and skipped.

{
  chunk_id: UUID,
  embedding_model: string,
  embedding_version: string,
  vector: float[1536],
  created_at: timestamp
}

Stage 4: Vector Store Indexing

Vectors are inserted into a vector store that supports ANN search. Common choices:

Pinecone: managed, serverless, easy ops.
Weaviate / Qdrant: self-hosted with hybrid search (dense + sparse).
pgvector: Postgres extension; good for moderate scale with SQL join capability.
FAISS: in-process library; requires custom persistence and serving.

Each index entry stores the vector plus the chunk_id as payload, so the retrieval service can join back to full chunk text and metadata from Postgres.

For incremental updates:

New document: chunk, embed, and upsert all chunks.
Updated document: delete old chunks by document_id filter, re-chunk and re-embed.
Deleted document: delete all chunks by document_id filter from both vector store and Postgres.

Stage 5: Retrieval

At query time, the query string is embedded using the same model as the index, then an ANN search returns the top-k nearest chunks by cosine similarity.

Retrieval Techniques

Dense retrieval: pure vector similarity. Fast; misses exact keyword matches.
Sparse retrieval (BM25): keyword-based scoring. Good for specific terms, IDs, code.
Hybrid retrieval: combine dense and sparse scores via Reciprocal Rank Fusion (RRF). Best recall for most use cases.
Re-ranking: pass top-k candidates through a cross-encoder re-ranker for higher precision before prompt assembly.

Retrieval parameters:

{
  query: string,
  top_k: int,           // e.g. 10 for retrieval, 3-5 after re-ranking
  filters: map,         // metadata filters (source, date range, doc type)
  min_score: float,     // discard low-relevance chunks
  retrieval_strategy: enum('dense', 'sparse', 'hybrid')
}

Stage 6: Prompt Assembly

Retrieved chunks are formatted into a structured prompt for the LLM.

System: You are a helpful assistant. Answer using only the provided context.

Context:
[1] {chunk_1_text} (source: {url_1})
[2] {chunk_2_text} (source: {url_2})
...

Question: {user_query}
Answer:

Prompt assembly considerations:

Fit retrieved context within the LLM context window minus reserved space for the answer.
Order chunks by relevance score (highest first) or by document recency depending on use case.
Include source citations to enable the LLM to reference its context.
Handle the case where no relevant chunks are found (no-context fallback prompt).

Ingestion Pipeline Architecture

For bulk ingestion, the pipeline is implemented as an async worker queue:

Upload API receives document, stores in S3, writes a job record to Postgres with status='pending'.
Job is enqueued to a task queue (Celery + Redis, or SQS).
Parse worker picks up job, extracts text, updates status='parsed'.
Chunk worker produces chunks, writes to Postgres, updates status='chunked'.
Embed worker batches chunks, calls embedding service, writes vectors to vector store, updates status='indexed'.

Each stage is independently scalable. Failed jobs are retried with backoff; dead-lettered after max retries.

Data Freshness and Consistency

Ingestion is eventually consistent: newly uploaded documents are not immediately searchable.
For time-sensitive content, a priority queue lane can expedite specific documents.
Document version tracking in Postgres ensures that a re-ingested document atomically replaces old chunks.
Embedding model upgrades require a full re-index; run new index in parallel, then cut over.

Observability

Ingestion metrics: documents queued, parse errors, chunk counts, embedding latency, indexing lag.
Retrieval metrics: query latency, top-k scores, empty-result rate, re-ranker score distribution.
Quality metrics: user feedback signals (thumbs up/down) joined to retrieved chunk_ids to identify low-performing chunks.

Common Interview Follow-Ups

How do you handle multi-lingual documents? Use a multilingual embedding model (e.g., mE5, BGE-M3); store language metadata for language-specific retrieval filters.
How do you handle very long documents? Hierarchical chunking: chunk into large parent chunks and small child chunks; retrieve child chunks, return parent context to LLM.
How do you evaluate RAG quality? Use RAGAS metrics: faithfulness, answer relevancy, context precision, context recall.
How do you prevent the LLM from hallucinating outside the context? System prompt instruction plus post-generation grounding check (verify claims against retrieved chunks).