When should you use RAG versus fine-tuning versus prompt engineering for LLMs?

Start with prompt engineering (zero cost): craft system prompts, few-shot examples, chain-of-thought. If insufficient, add RAG: retrieve relevant documents from a vector database (Pinecone, pgvector) and include in the prompt. Best for: frequently changing knowledge, reducing hallucination (grounded in retrieved facts), and large knowledge bases. If still insufficient, fine-tune with LoRA: train small adapter matrices (0.1-1% of parameters) on task-specific data. Best for: changing model style/tone, domain-specific terminology, and tasks where prompt engineering fails. Decision order: prompt engineering -> RAG -> fine-tuning. Each adds cost and complexity. Many production systems combine all three: fine-tuned model with RAG retrieval and carefully engineered prompts.

How does RAG (Retrieval-Augmented Generation) work?

RAG pipeline: (1) Ingestion: split documents into 500-1000 token chunks with overlap. Embed each chunk with an embedding model. Store in a vector database. (2) Query: embed the user question with the same model. ANN search for K most similar chunks (cosine similarity, K=3-10). (3) Construct prompt: system instructions + retrieved chunks + user question. (4) Generate: LLM produces answer grounded in the retrieved context. (5) Post-process: extract citations, filter low-confidence responses. Key challenges: chunk quality (split at semantic boundaries, not mid-sentence), retrieval relevance (hybrid search combining vector + BM25 keyword matching improves recall), and context limits (prioritize most relevant chunks within the model context window). RAG dramatically reduces hallucination by grounding responses in retrieved facts rather than parametric knowledge.

AI/ML Interview: Large Language Models — LLM Inference, Fine-Tuning, RAG, Prompt Engineering, Hallucination, Deployment

⏱ 6 min read

Large Language Models (LLMs) like GPT-4, Claude, and LLaMA have transformed software engineering. Understanding how to deploy, fine-tune, and evaluate LLMs is now a critical skill tested in ML engineering interviews. This guide covers the practical aspects of working with LLMs — from inference optimization to RAG pipelines to production deployment.

LLM Inference: How Generation Works

Text generation is autoregressive: the model generates one token at a time, using all previous tokens as context. Two phases: (1) Prefill — process the entire input prompt in parallel. The self-attention mechanism processes all input tokens simultaneously (fully parallelizable on GPU). Output: the KV cache (key-value pairs for each layer and each input token). (2) Decode — generate output tokens one at a time. Each new token attends to all previous tokens via the KV cache (no recomputation of previous attention). The new token K/V is appended to the cache. This is sequential and memory-bound (limited by GPU memory bandwidth, not compute). Why inference is slow: the decode phase generates tokens sequentially. A 500-token response requires 500 forward passes. Each pass reads the entire KV cache from GPU memory. For a 7B parameter model with 32 layers and 4096 context: the KV cache is ~1 GB. Reading 1 GB from GPU memory 500 times is the bottleneck, not the matrix multiplications. Optimization: (1) KV cache quantization — store the cache in FP8 or INT8 instead of FP16, halving memory. (2) Speculative decoding — a small “draft” model generates N candidate tokens cheaply. The large model verifies all N in parallel (single forward pass). If K of N are correct, you skip K decode steps. 2-3x speedup. (3) Continuous batching — instead of waiting for one request to finish before starting the next, interleave multiple requests. While request A waits for its next token, process request B. This maximizes GPU utilization.

Fine-Tuning vs RAG vs Prompt Engineering

Three approaches to customize LLM behavior: (1) Prompt engineering — craft the input to guide the model output. System prompts, few-shot examples, chain-of-thought reasoning. Cost: zero training. Latency: slightly higher (longer prompts). Limitation: bounded by the context window. Best for: quick customization, well-defined tasks, and when the model already has the knowledge. (2) RAG (Retrieval-Augmented Generation) — retrieve relevant documents from a knowledge base and include them in the prompt. Architecture: user query -> vector search in a knowledge base (Pinecone, Weaviate, pgvector) -> retrieve top-K relevant documents -> construct prompt with query + retrieved context -> LLM generates answer grounded in the context. Cost: vector database infrastructure + embedding model. Best for: knowledge that changes frequently (company docs, product catalogs), reducing hallucination (the answer is grounded in retrieved facts), and large knowledge bases that do not fit in a prompt. (3) Fine-tuning — train the model on task-specific data. LoRA (Low-Rank Adaptation): add small trainable matrices to the model (0.1-1% of parameters). The base model weights are frozen. Cost: GPU hours for training. Best for: changing the model behavior/style (tone, format, domain-specific terminology), tasks where prompt engineering is not sufficient, and when you have high-quality training data. Decision: start with prompt engineering. If insufficient, add RAG. If still insufficient, fine-tune.

RAG Architecture in Detail

A production RAG pipeline: (1) Document ingestion — split documents into chunks (500-1000 tokens per chunk with 100-token overlap between chunks). Generate embeddings for each chunk using an embedding model (OpenAI text-embedding-3-small, Cohere embed, or open-source models like E5). Store chunks + embeddings in a vector database (Pinecone, Weaviate, Qdrant, pgvector, or ChromaDB). (2) Query processing — embed the user query with the same embedding model. Search the vector database for the K most similar chunks (cosine similarity). K is typically 3-10. (3) Context construction — assemble the prompt: system instructions + retrieved chunks + user question. “Based on the following context, answer the question. Context: [chunk 1] [chunk 2] [chunk 3]. Question: {user_query}.” (4) Generation — the LLM generates an answer grounded in the retrieved context. (5) Post-processing — extract citations (which chunks contributed to the answer), filter low-confidence responses, and format the output. Challenges: (1) Chunk quality — poor chunking (splitting mid-sentence, losing context) degrades retrieval quality. Use semantic chunking (split at paragraph/section boundaries). (2) Retrieval relevance — embedding similarity does not always capture semantic relevance. Hybrid search (combine vector similarity with BM25 keyword matching) improves recall. (3) Context window limits — with K=10 chunks of 500 tokens = 5000 tokens of context, plus the system prompt and question, the total may approach the model context limit. Prioritize the most relevant chunks.

Hallucination and Evaluation

LLMs hallucinate: they generate plausible-sounding but factually incorrect information. Types: (1) Factual hallucination — the model states incorrect facts (“Python was created in 1989” — actually 1991). (2) Fabricated citations — the model cites papers, URLs, or quotes that do not exist. (3) Logical inconsistency — the model contradicts itself within the same response. Mitigation: (1) RAG — ground responses in retrieved facts. The model answers from provided context, not from its parameters. (2) Citation verification — require the model to cite specific passages from the context. Verify that the cited passage actually supports the claim. (3) Temperature control — lower temperature (0.0-0.3) reduces creativity and hallucination. Higher temperature (0.7-1.0) increases diversity but also hallucination risk. (4) Self-consistency — generate multiple responses and check for agreement. Inconsistencies indicate hallucination. Evaluation: (1) Automated metrics — BLEU, ROUGE (overlap with reference), BERTScore (semantic similarity), and task-specific metrics (F1 for QA, accuracy for classification). (2) LLM-as-judge — use a stronger model (GPT-4) to evaluate responses from a weaker model on: correctness, helpfulness, safety, and groundedness. (3) Human evaluation — the gold standard. Have domain experts rate responses. Expensive but necessary for high-stakes applications.

LLM Deployment and Serving

Serving LLMs in production requires managing GPU resources efficiently: (1) Model serving frameworks — vLLM (highest throughput, continuous batching, PagedAttention), TensorRT-LLM (NVIDIA optimized), and Ollama (simple local deployment). vLLM PagedAttention: manages KV cache memory like OS virtual memory — pages can be non-contiguous, enabling efficient memory utilization across concurrent requests. (2) Quantization — reduce model precision for faster inference and lower memory. FP16 -> INT8 (2x memory reduction, minimal quality loss). FP16 -> INT4 (4x reduction, some quality loss for small models, acceptable for large models). GPTQ, AWQ, and GGUF are popular quantization methods. A 70B model at FP16 requires 140 GB VRAM (2x A100 80GB). At INT4: 35 GB (fits on a single A100). (3) Horizontal scaling — multiple GPU servers behind a load balancer. Route requests based on model and available capacity. (4) Streaming — stream tokens to the client as they are generated (Server-Sent Events). The user sees the response forming in real-time instead of waiting for the complete generation. (5) Caching — cache exact prompt-response pairs (semantic cache) for repeated queries. Cache common prefixes (system prompts) in the KV cache across requests (prompt caching / prefix caching). (6) Cost management — LLM inference is expensive ($0.01-$0.10 per request for large models). Monitor token usage, set budget alerts, and use smaller models (7B-13B) for simple tasks. Route complex queries to large models (70B+) and simple ones to small models (smart routing).