Introduction
RAG (Retrieval-Augmented Generation) grounds LLM responses in retrieved external knowledge. By injecting relevant retrieved context into the prompt, RAG prevents hallucination and enables up-to-date knowledge without retraining the model.
Document Ingestion Pipeline
The pipeline ingests documents (PDFs, HTML, markdown, databases), extracts text, chunks it, generates embeddings, and stores them in a vector database. It is triggered by document upload or a scheduled crawl. A Document record holds: doc_id, source_url, content_type, raw_text, created_at, and chunk_count.
Chunking Strategy
Documents are split into chunks that fit within the embedding model’s context window and the LLM’s context window. Fixed-size chunking uses 512 tokens with a 50-token overlap; overlap preserves context across chunk boundaries. Semantic chunking splits at paragraph or section boundaries rather than fixed sizes. Recursive character splitting splits on double newlines, then single newlines, then sentences, until chunks are small enough. Each chunk carries metadata: doc_id, chunk_index, page_number, and section_header.
Embedding Generation
Each chunk is converted to a dense vector using an embedding model. OpenAI’s text-embedding-3-small produces 1536-dimensional vectors; local models like E5-large produce 1024-dimensional vectors. Batching processes 100 chunks per API call for cost efficiency. Embeddings are stored in the vector database alongside chunk text and metadata. When the embedding model changes, a full reindex is required.
Vector Retrieval
The user query is embedded using the same embedding model. An ANN search in the vector database returns the top-K most similar chunks (typically K=5 to 20). Cosine similarity (dot product of normalized vectors) is the standard similarity metric. Hybrid retrieval combines dense ANN results with sparse BM25 keyword results for better recall. A cross-encoder reranking model rescores the top-K candidates for improved precision.
Context Assembly
Retrieved chunks are assembled into a context string. Ordering matters: most relevant chunks are placed first, or surrounding document context is preserved. Deduplication removes nearly identical chunks using SimHash. Context truncation ensures the total context fits within the LLM’s context window (e.g., 128K tokens). Citation metadata tracks which chunk each piece of context came from for attribution in the response.
LLM Prompt Construction
The system prompt defines the assistant persona and instructions. Retrieved context is injected as a user turn or system turn. The user query is appended last. Example pattern: system = “You are a helpful assistant. Answer based on the provided context.” | context = [retrieved chunks] | user = [user question]. Temperature is set to 0 for factual accuracy. Structured output prompting is used when parseable responses are required.
Evaluation
Retrieval metrics include recall@K (fraction of relevant chunks in the top-K results) and MRR (mean reciprocal rank). Generation metrics include faithfulness (answer grounded in retrieved context), answer relevance (answers the actual question), and context precision. The RAGAS framework automates RAG evaluation. A/B testing compares chunking strategies and retrieval K values. Hallucination rate is monitored via a fact-checking pipeline.