What Is a RAG Pipeline?
Retrieval-Augmented Generation (RAG) is an architecture that improves LLM output quality by retrieving relevant documents from a knowledge base at query time and injecting them into the prompt context. A RAG pipeline service manages the full lifecycle: ingesting and indexing documents, retrieving relevant chunks at query time, and assembling the final prompt for the LLM.
RAG pipeline low level design is a high-frequency interview question for ML platform and AI infrastructure roles. Candidates are expected to walk through each stage with concrete data models and design tradeoffs.
Requirements
Functional
- Document ingestion: accept documents in multiple formats (PDF, HTML, plain text, Markdown).
- Chunking: split documents into semantically coherent chunks.
- Embedding generation: produce dense vector embeddings for each chunk.
- Vector store indexing: store and index embeddings for approximate nearest neighbor (ANN) search.
- Retrieval: given a query, return the top-k most relevant chunks.
- LLM prompt assembly: format retrieved context and user query into a prompt.
- Incremental updates: support adding, updating, and deleting documents without full re-index.
Non-Functional
- Retrieval latency < 200 ms p99.
- Ingestion pipeline throughput: process thousands of documents per hour.
- Embedding consistency: all chunks for the same document use the same embedding model version.
- Scalable to billions of chunks across large knowledge bases.
High-Level Architecture
[Ingestion API]
|
v
[Document Parser]
|
v
[Chunker]
|
v
[Embedding Service]
|
v
[Vector Store] <------[Retrieval Service]<---[Query API]
|
v
[Prompt Assembler]
|
v
[LLM Gateway]
Stage 1: Document Parsing
The parser converts raw document bytes into clean plain text with metadata extraction.
- PDF: use a PDF extraction library (pdfminer, PyMuPDF) to extract text, preserving section headers.
- HTML: strip tags, extract main content area (Readability algorithm), preserve heading structure.
- Markdown / plain text: minimal transformation.
Extracted metadata (title, source URL, author, creation date, document_id) is stored in a relational metadata store (Postgres) alongside a reference to the raw document in object storage (S3).
Stage 2: Chunking
Chunking splits documents into passages that fit within the embedding model’s context window (typically 512 tokens) while preserving semantic coherence.
Chunking Strategies
- Fixed-size with overlap: split every N tokens with a K-token overlap between adjacent chunks. Simple and predictable; good default.
- Sentence boundary: split at sentence ends, grouping up to N tokens per chunk. Preserves sentence integrity.
- Semantic chunking: detect topic shifts using embedding similarity between consecutive sentences; split at low-similarity boundaries. Higher quality, higher compute cost.
- Structural chunking: split at document section headers. Best for well-structured documents.
A Chunk record is produced for each segment:
{
chunk_id: UUID,
document_id: UUID,
chunk_index: int,
text: string,
token_count: int,
start_char: int,
end_char: int,
metadata: map // inherits document metadata
}
Stage 3: Embedding Generation
Each chunk is encoded into a dense vector by an embedding model (e.g., text-embedding-3-large, BGE-M3, E5-large).
- Chunks are batched (e.g., 64 chunks per API call) to maximize throughput.
- Embedding calls are made to an embedding service (OpenAI Embeddings API, a self-hosted model, or the LLM Gateway).
- The embedding model version and dimension are stored alongside each vector to support model migration.
- Failed chunks are queued for retry; poison-pill chunks (e.g., empty after parsing) are logged and skipped.
{
chunk_id: UUID,
embedding_model: string,
embedding_version: string,
vector: float[1536],
created_at: timestamp
}
Stage 4: Vector Store Indexing
Vectors are inserted into a vector store that supports ANN search. Common choices:
- Pinecone: managed, serverless, easy ops.
- Weaviate / Qdrant: self-hosted with hybrid search (dense + sparse).
- pgvector: Postgres extension; good for moderate scale with SQL join capability.
- FAISS: in-process library; requires custom persistence and serving.
Each index entry stores the vector plus the chunk_id as payload, so the retrieval service can join back to full chunk text and metadata from Postgres.
For incremental updates:
- New document: chunk, embed, and upsert all chunks.
- Updated document: delete old chunks by document_id filter, re-chunk and re-embed.
- Deleted document: delete all chunks by document_id filter from both vector store and Postgres.
Stage 5: Retrieval
At query time, the query string is embedded using the same model as the index, then an ANN search returns the top-k nearest chunks by cosine similarity.
Retrieval Techniques
- Dense retrieval: pure vector similarity. Fast; misses exact keyword matches.
- Sparse retrieval (BM25): keyword-based scoring. Good for specific terms, IDs, code.
- Hybrid retrieval: combine dense and sparse scores via Reciprocal Rank Fusion (RRF). Best recall for most use cases.
- Re-ranking: pass top-k candidates through a cross-encoder re-ranker for higher precision before prompt assembly.
Retrieval parameters:
{
query: string,
top_k: int, // e.g. 10 for retrieval, 3-5 after re-ranking
filters: map, // metadata filters (source, date range, doc type)
min_score: float, // discard low-relevance chunks
retrieval_strategy: enum('dense', 'sparse', 'hybrid')
}
Stage 6: Prompt Assembly
Retrieved chunks are formatted into a structured prompt for the LLM.
System: You are a helpful assistant. Answer using only the provided context.
Context:
[1] {chunk_1_text} (source: {url_1})
[2] {chunk_2_text} (source: {url_2})
...
Question: {user_query}
Answer:
Prompt assembly considerations:
- Fit retrieved context within the LLM context window minus reserved space for the answer.
- Order chunks by relevance score (highest first) or by document recency depending on use case.
- Include source citations to enable the LLM to reference its context.
- Handle the case where no relevant chunks are found (no-context fallback prompt).
Ingestion Pipeline Architecture
For bulk ingestion, the pipeline is implemented as an async worker queue:
- Upload API receives document, stores in S3, writes a job record to Postgres with status='pending'.
- Job is enqueued to a task queue (Celery + Redis, or SQS).
- Parse worker picks up job, extracts text, updates status='parsed'.
- Chunk worker produces chunks, writes to Postgres, updates status='chunked'.
- Embed worker batches chunks, calls embedding service, writes vectors to vector store, updates status='indexed'.
Each stage is independently scalable. Failed jobs are retried with backoff; dead-lettered after max retries.
Data Freshness and Consistency
- Ingestion is eventually consistent: newly uploaded documents are not immediately searchable.
- For time-sensitive content, a priority queue lane can expedite specific documents.
- Document version tracking in Postgres ensures that a re-ingested document atomically replaces old chunks.
- Embedding model upgrades require a full re-index; run new index in parallel, then cut over.
Observability
- Ingestion metrics: documents queued, parse errors, chunk counts, embedding latency, indexing lag.
- Retrieval metrics: query latency, top-k scores, empty-result rate, re-ranker score distribution.
- Quality metrics: user feedback signals (thumbs up/down) joined to retrieved chunk_ids to identify low-performing chunks.
Common Interview Follow-Ups
- How do you handle multi-lingual documents? Use a multilingual embedding model (e.g., mE5, BGE-M3); store language metadata for language-specific retrieval filters.
- How do you handle very long documents? Hierarchical chunking: chunk into large parent chunks and small child chunks; retrieve child chunks, return parent context to LLM.
- How do you evaluate RAG quality? Use RAGAS metrics: faithfulness, answer relevancy, context precision, context recall.
- How do you prevent the LLM from hallucinating outside the context? System prompt instruction plus post-generation grounding check (verify claims against retrieved chunks).
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture