Low Level Design: RAG Pipeline Service

What Is a RAG Pipeline?

Retrieval-Augmented Generation (RAG) is an architecture that improves LLM output quality by retrieving relevant documents from a knowledge base at query time and injecting them into the prompt context. A RAG pipeline service manages the full lifecycle: ingesting and indexing documents, retrieving relevant chunks at query time, and assembling the final prompt for the LLM.

RAG pipeline low level design is a high-frequency interview question for ML platform and AI infrastructure roles. Candidates are expected to walk through each stage with concrete data models and design tradeoffs.

Requirements

Functional

  • Document ingestion: accept documents in multiple formats (PDF, HTML, plain text, Markdown).
  • Chunking: split documents into semantically coherent chunks.
  • Embedding generation: produce dense vector embeddings for each chunk.
  • Vector store indexing: store and index embeddings for approximate nearest neighbor (ANN) search.
  • Retrieval: given a query, return the top-k most relevant chunks.
  • LLM prompt assembly: format retrieved context and user query into a prompt.
  • Incremental updates: support adding, updating, and deleting documents without full re-index.

Non-Functional

  • Retrieval latency < 200 ms p99.
  • Ingestion pipeline throughput: process thousands of documents per hour.
  • Embedding consistency: all chunks for the same document use the same embedding model version.
  • Scalable to billions of chunks across large knowledge bases.

High-Level Architecture

[Ingestion API]
  |
  v
[Document Parser]
  |
  v
[Chunker]
  |
  v
[Embedding Service]
  |
  v
[Vector Store] <------[Retrieval Service]<---[Query API]
                                |
                                v
                        [Prompt Assembler]
                                |
                                v
                          [LLM Gateway]

Stage 1: Document Parsing

The parser converts raw document bytes into clean plain text with metadata extraction.

  • PDF: use a PDF extraction library (pdfminer, PyMuPDF) to extract text, preserving section headers.
  • HTML: strip tags, extract main content area (Readability algorithm), preserve heading structure.
  • Markdown / plain text: minimal transformation.

Extracted metadata (title, source URL, author, creation date, document_id) is stored in a relational metadata store (Postgres) alongside a reference to the raw document in object storage (S3).

Stage 2: Chunking

Chunking splits documents into passages that fit within the embedding model’s context window (typically 512 tokens) while preserving semantic coherence.

Chunking Strategies

  • Fixed-size with overlap: split every N tokens with a K-token overlap between adjacent chunks. Simple and predictable; good default.
  • Sentence boundary: split at sentence ends, grouping up to N tokens per chunk. Preserves sentence integrity.
  • Semantic chunking: detect topic shifts using embedding similarity between consecutive sentences; split at low-similarity boundaries. Higher quality, higher compute cost.
  • Structural chunking: split at document section headers. Best for well-structured documents.

A Chunk record is produced for each segment:

{
  chunk_id: UUID,
  document_id: UUID,
  chunk_index: int,
  text: string,
  token_count: int,
  start_char: int,
  end_char: int,
  metadata: map    // inherits document metadata
}

Stage 3: Embedding Generation

Each chunk is encoded into a dense vector by an embedding model (e.g., text-embedding-3-large, BGE-M3, E5-large).

  • Chunks are batched (e.g., 64 chunks per API call) to maximize throughput.
  • Embedding calls are made to an embedding service (OpenAI Embeddings API, a self-hosted model, or the LLM Gateway).
  • The embedding model version and dimension are stored alongside each vector to support model migration.
  • Failed chunks are queued for retry; poison-pill chunks (e.g., empty after parsing) are logged and skipped.
{
  chunk_id: UUID,
  embedding_model: string,
  embedding_version: string,
  vector: float[1536],
  created_at: timestamp
}

Stage 4: Vector Store Indexing

Vectors are inserted into a vector store that supports ANN search. Common choices:

  • Pinecone: managed, serverless, easy ops.
  • Weaviate / Qdrant: self-hosted with hybrid search (dense + sparse).
  • pgvector: Postgres extension; good for moderate scale with SQL join capability.
  • FAISS: in-process library; requires custom persistence and serving.

Each index entry stores the vector plus the chunk_id as payload, so the retrieval service can join back to full chunk text and metadata from Postgres.

For incremental updates:

  • New document: chunk, embed, and upsert all chunks.
  • Updated document: delete old chunks by document_id filter, re-chunk and re-embed.
  • Deleted document: delete all chunks by document_id filter from both vector store and Postgres.

Stage 5: Retrieval

At query time, the query string is embedded using the same model as the index, then an ANN search returns the top-k nearest chunks by cosine similarity.

Retrieval Techniques

  • Dense retrieval: pure vector similarity. Fast; misses exact keyword matches.
  • Sparse retrieval (BM25): keyword-based scoring. Good for specific terms, IDs, code.
  • Hybrid retrieval: combine dense and sparse scores via Reciprocal Rank Fusion (RRF). Best recall for most use cases.
  • Re-ranking: pass top-k candidates through a cross-encoder re-ranker for higher precision before prompt assembly.

Retrieval parameters:

{
  query: string,
  top_k: int,           // e.g. 10 for retrieval, 3-5 after re-ranking
  filters: map,         // metadata filters (source, date range, doc type)
  min_score: float,     // discard low-relevance chunks
  retrieval_strategy: enum('dense', 'sparse', 'hybrid')
}

Stage 6: Prompt Assembly

Retrieved chunks are formatted into a structured prompt for the LLM.

System: You are a helpful assistant. Answer using only the provided context.

Context:
[1] {chunk_1_text} (source: {url_1})
[2] {chunk_2_text} (source: {url_2})
...

Question: {user_query}
Answer:

Prompt assembly considerations:

  • Fit retrieved context within the LLM context window minus reserved space for the answer.
  • Order chunks by relevance score (highest first) or by document recency depending on use case.
  • Include source citations to enable the LLM to reference its context.
  • Handle the case where no relevant chunks are found (no-context fallback prompt).

Ingestion Pipeline Architecture

For bulk ingestion, the pipeline is implemented as an async worker queue:

  1. Upload API receives document, stores in S3, writes a job record to Postgres with status='pending'.
  2. Job is enqueued to a task queue (Celery + Redis, or SQS).
  3. Parse worker picks up job, extracts text, updates status='parsed'.
  4. Chunk worker produces chunks, writes to Postgres, updates status='chunked'.
  5. Embed worker batches chunks, calls embedding service, writes vectors to vector store, updates status='indexed'.

Each stage is independently scalable. Failed jobs are retried with backoff; dead-lettered after max retries.

Data Freshness and Consistency

  • Ingestion is eventually consistent: newly uploaded documents are not immediately searchable.
  • For time-sensitive content, a priority queue lane can expedite specific documents.
  • Document version tracking in Postgres ensures that a re-ingested document atomically replaces old chunks.
  • Embedding model upgrades require a full re-index; run new index in parallel, then cut over.

Observability

  • Ingestion metrics: documents queued, parse errors, chunk counts, embedding latency, indexing lag.
  • Retrieval metrics: query latency, top-k scores, empty-result rate, re-ranker score distribution.
  • Quality metrics: user feedback signals (thumbs up/down) joined to retrieved chunk_ids to identify low-performing chunks.

Common Interview Follow-Ups

  • How do you handle multi-lingual documents? Use a multilingual embedding model (e.g., mE5, BGE-M3); store language metadata for language-specific retrieval filters.
  • How do you handle very long documents? Hierarchical chunking: chunk into large parent chunks and small child chunks; retrieve child chunks, return parent context to LLM.
  • How do you evaluate RAG quality? Use RAGAS metrics: faithfulness, answer relevancy, context precision, context recall.
  • How do you prevent the LLM from hallucinating outside the context? System prompt instruction plus post-generation grounding check (verify claims against retrieved chunks).

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top