Design an LLM inference API — the service that accepts user prompts and returns model completions, like the OpenAI API, Anthropic’s Claude API, or an internal LLM serving layer. This is a 2026-era system design question that combines classic distributed systems with the specific challenges of serving large language models at scale.
Requirements Clarification
- API: POST /v1/completions with {model, messages, max_tokens, temperature, stream}. Support streaming (SSE) and batch responses.
- Models: Multiple model sizes (7B, 70B, 405B parameters). Different latency and cost profiles. Route requests to the appropriate model.
- Scale: 10M API calls/day, peak 1,000 requests/sec. Average response: 500 tokens at ~30 tokens/sec → ~17 seconds streaming duration.
- Latency targets: Time to first token (TTFT) <500ms. Generation speed ≥ 30 tokens/sec (feels real-time to users).
- Cost: GPU compute is expensive. Maximize GPU utilization, minimize wasted capacity.
Why LLM Serving Is Different
Serving a 70B parameter model is nothing like serving a REST API. Key differences:
- Model size: A 70B model in FP16 = 140GB. Doesn’t fit on one A100 GPU (80GB). Requires tensor parallelism across 2–4 GPUs.
- Stateful computation: Each generated token depends on all previous tokens. Unlike stateless REST handlers, each request has growing state (the KV cache).
- Variable output length: A request for “write me a poem” might produce 50 tokens. A request for “summarize this legal contract” might produce 2,000. You can’t predict upfront.
- Memory bottleneck: LLM inference is memory-bandwidth-bound, not compute-bound. A100 GPU bandwidth: 2 TB/s. The bottleneck is loading model weights from HBM, not the FLOPS.
The KV Cache
During autoregressive generation, the attention mechanism must attend to all previous tokens. Recomputing attention for all previous tokens on each new token would be O(n²) work. The KV cache solves this: store the Key and Value matrices from each attention layer for each processed token. Generating the next token only requires computing Q for the new token, then attending to the cached K, V — O(1) per token after the prefill phase.
Request lifecycle:
1. Prefill phase: process the entire prompt in parallel (like a forward pass)
→ Compute and cache K, V for all prompt tokens
→ Time: proportional to prompt length (fast, uses full GPU parallelism)
2. Decode phase: generate tokens one at a time
→ Each step: compute Q for new token, attend to KV cache, sample next token
→ Time: proportional to output length × per-token latency (~30ms/token on A100)
KV cache memory: 2 × layers × d_model × seq_len × bytes_per_element
LLaMA-3 70B: 80 layers × 8192 dim × 4096 tokens × 2 bytes = ~5GB per request
KV cache memory is the primary constraint on how many requests can run concurrently. A 40GB GPU after loading the 70B model weights (tensor parallel reduces per-GPU weight to ~35GB) has ~5GB free — enough for roughly one concurrent request at 4096 token context. This is why batching and memory management are critical.
Continuous Batching (PagedAttention / vLLM)
Naive static batching: batch 8 requests, wait for all to finish before starting the next batch. Problem: one long request (2,000 tokens) blocks 7 short requests that finished at token 100.
Continuous batching (vLLM, TGI): after each token generation step, finished requests leave the batch and new requests enter. The GPU is never idle waiting for laggards. Throughput improvement: 10–20× over static batching.
PagedAttention (vLLM’s memory management): instead of allocating a contiguous block of GPU memory for each request’s KV cache (wasteful, causes fragmentation), divide KV cache into fixed-size “pages” and allocate non-contiguously. Physical pages are mapped to logical sequence positions via a page table — identical to OS virtual memory. Enables fine-grained memory management, allows more requests to fit simultaneously.
Architecture
Client → API Gateway (rate limiting, auth, routing)
↓
Request Queue (Redis or internal priority queue)
↓
Load Balancer → Inference Worker Pool
├─ Small model servers (7B, 2×A100)
├─ Medium model servers (70B, 4×A100)
└─ Large model servers (405B, 8×A100, tensor parallel)
Inference Worker (vLLM):
Continuous batching scheduler
KV cache (PagedAttention)
CUDA kernel execution
SSE/streaming output → API Gateway → Client
Model Routing
Not every request needs a 70B model. Route based on complexity signals:
- Explicit routing: Caller specifies model (GPT-4o vs GPT-4o-mini). Simple, caller controls cost.
- Implicit routing: A small classifier (or the small model itself) estimates task complexity. Simple Q&A → 7B model. Code generation, reasoning → 70B. Saves cost without exposing routing logic to callers.
- Cascade routing: Try 7B first. If confidence is low (log-probability entropy above threshold), re-route to 70B. Higher latency but optimal cost.
Streaming Responses (SSE)
Users expect streaming — seeing tokens appear as they’re generated, not waiting 17 seconds for the complete response.
# Server-Sent Events: each token pushed as it's generated
HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
data: {"token": "The", "index": 0}
data: {"token": " capital", "index": 1}
data: {"token": " of", "index": 2}
data: [DONE]
The inference worker streams tokens as they’re generated. The API gateway passes them through to the client via SSE (Server-Sent Events) or WebSocket. Each token appears in the user’s interface in real time — 30 tokens/sec feels like fast typing.
Prompt Caching
Many API calls share a long system prompt (e.g., “You are a helpful customer service agent for Acme Corp…” followed by different user messages). Recomputing the KV cache for the identical system prompt on every request wastes GPU cycles.
Prefix caching: Hash the prompt prefix. If an identical prefix was recently processed, reuse the stored KV cache for that prefix. Only compute attention for the new tokens appended to the prefix. Anthropic, OpenAI, and Google all offer prefix caching for long system prompts. Cache hit rate on production traffic: 40–70%.
Quantization and Speculative Decoding
Quantization: Reduce weight precision from FP16 (2 bytes/param) to INT8 (1 byte) or INT4 (0.5 bytes). A 70B model drops from 140GB to 35GB in INT4, fitting on a single A100. Quality loss: 1–3% on benchmarks. Throughput gain: 2–4× (lower memory bandwidth requirement). Production standard: GPTQ or AWQ quantization for serving.
Speculative decoding: Use a small draft model (7B) to propose K tokens speculatively, then verify all K tokens with the large model in one forward pass (parallel verification is cheap). Accepted tokens count as output; rejected tokens cause a fallback. Speedup: 2–3× for tasks where the draft model is usually right (common English text, simple code). No quality loss — mathematically equivalent to greedy decoding from the large model.
Rate Limiting and Queuing
LLM inference is expensive — a single request can run for 20+ seconds. Two layers of rate limiting:
- Token-based rate limits: Limit requests/min and tokens/min per API key. Tokens matter more than requests (a 4,096-token request costs 8× a 512-token request).
- Request queue: When all inference workers are at capacity, requests queue rather than fail. Priority queue: paid tier jumps the queue. Queue depth monitoring triggers auto-scaling of new inference instances.
Interview Follow-ups
- How would you handle a request that exceeds the model’s context window (128K tokens)?
- How do you auto-scale GPU instances in response to demand spikes? (GPU cold start is slow — 5–10 min to load model weights.)
- How would you implement fine-tuned model serving — each customer has their own LoRA adapter on the same base model?
- How do you enforce content safety filtering without adding significant latency?
- How do you build observability for LLM inference — what metrics matter beyond standard p50/p99 latency?
Related System Design Topics
- Load Balancing — request routing across GPU pods; prefix-aware routing sends requests with identical system prompts to the same pod for KV cache reuse
- Caching Strategies — KV cache (key-value attention tensors) is the LLM-specific cache; PagedAttention manages it like virtual memory paging
- Message Queues — request queue for continuous batching; new requests join in-flight batches at token boundaries without head-of-line blocking
- Design a Monitoring & Alerting System — TTFT and TPS metrics require sub-second scrape intervals; GPU utilization and KV cache hit rate are key SLIs
- API Design (REST vs GraphQL vs gRPC) — SSE for streaming token delivery; OpenAI-compatible REST API as the de-facto standard; gRPC for internal pod communication