Introduction
Serving large language models (GPT-4, Llama) requires specialized infrastructure for low latency, high throughput, and GPU memory management. The key challenges are large model size (7B to 70B+ parameters) and the sequential nature of token generation, which limits parallelism at the request level.
Model Loading and GPU Memory
Model weights are loaded from storage into GPU VRAM at startup. A 7B parameter model in float16 requires approximately 14GB of VRAM. A 70B model requires approximately 140GB, necessitating multiple GPUs. The model is loaded once and kept resident in GPU memory; all concurrent requests share the same read-only weights. The VRAM budget must cover weights, KV cache, and activation memory.
KV Cache
During autoregressive generation, each new token attends to all previous tokens. Without caching, Keys and Values for all prior tokens are recomputed at every step, resulting in O(N^2) compute. The KV cache stores the K and V tensors for each layer at each already-generated position. Subsequent tokens only compute Q for the new position and reuse cached K and V tensors. This reduces compute from O(N^2) to O(N) per token. KV cache size grows linearly with sequence length.
Continuous Batching
Naive batching groups requests of the same length together, leaving the GPU idle while waiting for shorter sequences to finish. Continuous batching (iteration-level scheduling) checks after each token generation step whether any sequence is complete; completed sequences are evicted from the batch and new requests are inserted immediately. This maximizes GPU utilization. PagedAttention (used in vLLM) stores the KV cache in non-contiguous pages, similar to OS virtual memory, enabling efficient memory sharing across requests.
Tensor Parallelism
Tensor parallelism splits the model across multiple GPUs: each GPU holds a slice of each weight matrix, computes a partial result, and an all-reduce operation synchronizes the results. Pipeline parallelism places different layers on different GPUs and uses micro-batching to hide the pipeline bubble. Tensor parallelism is preferred for latency-sensitive, small-batch requests. Pipeline parallelism is better suited for throughput-optimized serving with larger batches.
Quantization
Quantization reduces model size and increases throughput by using lower-precision weights. INT8 quantization achieves a 50% size reduction with minimal accuracy loss. INT4 quantization (GPTQ, AWQ) achieves a 75% size reduction with a small accuracy loss. Dynamic quantization stores weights in INT4 and dequantizes to float16 for computation. This enables serving 70B models on two 40GB GPUs instead of four.
Streaming Token Generation
Rather than waiting for the full response, clients receive tokens as they are generated. The server sends tokens over SSE or a WebSocket stream. Each token is sent immediately upon generation. This reduces perceived time-to-first-token and improves UX in chatbots and coding assistants. Stream cancellation is handled when the client disconnects (e.g., user navigates away), stopping generation and freeing GPU resources.
Auto-Scaling
GPU nodes are scaled based on queue depth and request rate. New node startup takes 2 to 5 minutes due to GPU allocation and model loading. Pre-warming keeps a pool of warm nodes based on predicted traffic patterns. Scale-to-zero is possible during off-hours with a minimum of one node retained. A router distributes requests across nodes and tracks per-node capacity. Spot or preemptible GPUs reduce cost; checkpointing handles unexpected preemptions.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture