AI/ML Interview: Large Language Models — LLM Inference, Fine-Tuning, RAG, Prompt Engineering, Hallucination, Deployment

Large Language Models (LLMs) like GPT-4, Claude, and LLaMA have transformed software engineering. Understanding how to deploy, fine-tune, and evaluate LLMs is now a critical skill tested in ML engineering interviews. This guide covers the practical aspects of working with LLMs — from inference optimization to RAG pipelines to production deployment.

LLM Inference: How Generation Works

Text generation is autoregressive: the model generates one token at a time, using all previous tokens as context. Two phases: (1) Prefill — process the entire input prompt in parallel. The self-attention mechanism processes all input tokens simultaneously (fully parallelizable on GPU). Output: the KV cache (key-value pairs for each layer and each input token). (2) Decode — generate output tokens one at a time. Each new token attends to all previous tokens via the KV cache (no recomputation of previous attention). The new token K/V is appended to the cache. This is sequential and memory-bound (limited by GPU memory bandwidth, not compute). Why inference is slow: the decode phase generates tokens sequentially. A 500-token response requires 500 forward passes. Each pass reads the entire KV cache from GPU memory. For a 7B parameter model with 32 layers and 4096 context: the KV cache is ~1 GB. Reading 1 GB from GPU memory 500 times is the bottleneck, not the matrix multiplications. Optimization: (1) KV cache quantization — store the cache in FP8 or INT8 instead of FP16, halving memory. (2) Speculative decoding — a small “draft” model generates N candidate tokens cheaply. The large model verifies all N in parallel (single forward pass). If K of N are correct, you skip K decode steps. 2-3x speedup. (3) Continuous batching — instead of waiting for one request to finish before starting the next, interleave multiple requests. While request A waits for its next token, process request B. This maximizes GPU utilization.

Fine-Tuning vs RAG vs Prompt Engineering

Three approaches to customize LLM behavior: (1) Prompt engineering — craft the input to guide the model output. System prompts, few-shot examples, chain-of-thought reasoning. Cost: zero training. Latency: slightly higher (longer prompts). Limitation: bounded by the context window. Best for: quick customization, well-defined tasks, and when the model already has the knowledge. (2) RAG (Retrieval-Augmented Generation) — retrieve relevant documents from a knowledge base and include them in the prompt. Architecture: user query -> vector search in a knowledge base (Pinecone, Weaviate, pgvector) -> retrieve top-K relevant documents -> construct prompt with query + retrieved context -> LLM generates answer grounded in the context. Cost: vector database infrastructure + embedding model. Best for: knowledge that changes frequently (company docs, product catalogs), reducing hallucination (the answer is grounded in retrieved facts), and large knowledge bases that do not fit in a prompt. (3) Fine-tuning — train the model on task-specific data. LoRA (Low-Rank Adaptation): add small trainable matrices to the model (0.1-1% of parameters). The base model weights are frozen. Cost: GPU hours for training. Best for: changing the model behavior/style (tone, format, domain-specific terminology), tasks where prompt engineering is not sufficient, and when you have high-quality training data. Decision: start with prompt engineering. If insufficient, add RAG. If still insufficient, fine-tune.

RAG Architecture in Detail

A production RAG pipeline: (1) Document ingestion — split documents into chunks (500-1000 tokens per chunk with 100-token overlap between chunks). Generate embeddings for each chunk using an embedding model (OpenAI text-embedding-3-small, Cohere embed, or open-source models like E5). Store chunks + embeddings in a vector database (Pinecone, Weaviate, Qdrant, pgvector, or ChromaDB). (2) Query processing — embed the user query with the same embedding model. Search the vector database for the K most similar chunks (cosine similarity). K is typically 3-10. (3) Context construction — assemble the prompt: system instructions + retrieved chunks + user question. “Based on the following context, answer the question. Context: [chunk 1] [chunk 2] [chunk 3]. Question: {user_query}.” (4) Generation — the LLM generates an answer grounded in the retrieved context. (5) Post-processing — extract citations (which chunks contributed to the answer), filter low-confidence responses, and format the output. Challenges: (1) Chunk quality — poor chunking (splitting mid-sentence, losing context) degrades retrieval quality. Use semantic chunking (split at paragraph/section boundaries). (2) Retrieval relevance — embedding similarity does not always capture semantic relevance. Hybrid search (combine vector similarity with BM25 keyword matching) improves recall. (3) Context window limits — with K=10 chunks of 500 tokens = 5000 tokens of context, plus the system prompt and question, the total may approach the model context limit. Prioritize the most relevant chunks.

Hallucination and Evaluation

LLMs hallucinate: they generate plausible-sounding but factually incorrect information. Types: (1) Factual hallucination — the model states incorrect facts (“Python was created in 1989” — actually 1991). (2) Fabricated citations — the model cites papers, URLs, or quotes that do not exist. (3) Logical inconsistency — the model contradicts itself within the same response. Mitigation: (1) RAG — ground responses in retrieved facts. The model answers from provided context, not from its parameters. (2) Citation verification — require the model to cite specific passages from the context. Verify that the cited passage actually supports the claim. (3) Temperature control — lower temperature (0.0-0.3) reduces creativity and hallucination. Higher temperature (0.7-1.0) increases diversity but also hallucination risk. (4) Self-consistency — generate multiple responses and check for agreement. Inconsistencies indicate hallucination. Evaluation: (1) Automated metrics — BLEU, ROUGE (overlap with reference), BERTScore (semantic similarity), and task-specific metrics (F1 for QA, accuracy for classification). (2) LLM-as-judge — use a stronger model (GPT-4) to evaluate responses from a weaker model on: correctness, helpfulness, safety, and groundedness. (3) Human evaluation — the gold standard. Have domain experts rate responses. Expensive but necessary for high-stakes applications.

LLM Deployment and Serving

Serving LLMs in production requires managing GPU resources efficiently: (1) Model serving frameworks — vLLM (highest throughput, continuous batching, PagedAttention), TensorRT-LLM (NVIDIA optimized), and Ollama (simple local deployment). vLLM PagedAttention: manages KV cache memory like OS virtual memory — pages can be non-contiguous, enabling efficient memory utilization across concurrent requests. (2) Quantization — reduce model precision for faster inference and lower memory. FP16 -> INT8 (2x memory reduction, minimal quality loss). FP16 -> INT4 (4x reduction, some quality loss for small models, acceptable for large models). GPTQ, AWQ, and GGUF are popular quantization methods. A 70B model at FP16 requires 140 GB VRAM (2x A100 80GB). At INT4: 35 GB (fits on a single A100). (3) Horizontal scaling — multiple GPU servers behind a load balancer. Route requests based on model and available capacity. (4) Streaming — stream tokens to the client as they are generated (Server-Sent Events). The user sees the response forming in real-time instead of waiting for the complete generation. (5) Caching — cache exact prompt-response pairs (semantic cache) for repeated queries. Cache common prefixes (system prompts) in the KV cache across requests (prompt caching / prefix caching). (6) Cost management — LLM inference is expensive ($0.01-$0.10 per request for large models). Monitor token usage, set budget alerts, and use smaller models (7B-13B) for simple tasks. Route complex queries to large models (70B+) and simple ones to small models (smart routing).

Scroll to Top