Question 1

What is the KV cache in LLM inference and why is it memory-intensive?

Accepted Answer

During autoregressive decoding, each transformer layer computes key and value projections for every token in the context. Without caching, each new token generation would recompute K and V for all prior tokens at O(n) cost per step, making generation O(n^2) overall. The KV cache stores those tensors after the first pass so only the new token's K/V are computed per step. Memory cost is 2 * num_layers * num_heads * head_dim * seq_len * bytes_per_element per sequence. For a 70B-parameter model with 4096 context length and fp16, a single sequence can consume ~8 GB. This is why serving systems like vLLM use PagedAttention to manage KV cache in non-contiguous memory blocks, similar to OS virtual memory paging, enabling much higher concurrent batch sizes.

Question 2

What is the difference between static batching and continuous batching in LLM serving?

Accepted Answer

Static (or naive) batching waits until a fixed batch size is assembled, runs all sequences together until every sequence in the batch finishes, then starts a new batch. Because sequences have variable output lengths, the GPU idles waiting for the longest sequence to complete — wasted capacity called 'bubbles.' Continuous batching (also called iteration-level scheduling, popularized by Orca) inserts new requests into the batch at every decoding step as soon as a slot frees up. The batch composition changes dynamically each forward pass. This drastically improves GPU utilization (often 5-10x throughput improvement) at the cost of more complex request management and memory pressure from mixing prefill and decode phases — which systems like Sarathi-Serve address by chunking prefills.

Question 3

When should you choose tensor parallelism over pipeline parallelism for LLM inference?

Accepted Answer

Tensor parallelism (TP) shards individual weight matrices across devices — e.g., splitting attention heads across 8 GPUs — so every GPU participates in every layer's computation. It requires all-reduce collectives every layer (high bandwidth demand, ~300 GB/s NVLink), but minimises latency since all GPUs advance in lockstep. Use TP when devices are on the same node connected by NVLink and latency is the primary constraint (e.g., interactive chat). Pipeline parallelism (PP) assigns different layers to different devices and passes activations sequentially. Inter-stage communication is smaller (just activations, not weight shards) and tolerates slower interconnects (InfiniBand across nodes). PP introduces pipeline bubbles during prefill but is better for throughput-oriented batch workloads spread across nodes. Most production systems combine both: TP within a node, PP across nodes (Megatron-LM style).

Question 4

How does quantization affect LLM inference accuracy and what are the common schemes?

Accepted Answer

Quantization reduces weight and/or activation precision from fp16/bf16 to int8 or int4, cutting memory footprint and enabling faster integer GEMM operations on hardware like NVIDIA's Tensor Cores. Weight-only quantization (GPTQ, AWQ) quantizes weights post-training and dequantizes on the fly; accuracy loss on perplexity benchmarks is typically

Question 5

How is streaming token generation implemented with Server-Sent Events (SSE)?

Accepted Answer

The server holds the HTTP connection open and writes each generated token as it is decoded, using the SSE wire format: 'data: {"token": "hello"}nn'. The Content-Type must be text/event-stream and the response must not be buffered (set X-Accel-Buffering: no for nginx). On the inference side, the model's sampling loop yields one token at a time; a generator function in the serving layer (FastAPI with StreamingResponse, or vLLM's async engine) writes each token to the response stream immediately. The client uses the EventSource API or a fetch reader loop to consume chunks. Key design concerns: token detokenization must handle multi-byte UTF-8 sequences that may span chunk boundaries (buffer incomplete bytes); error mid-stream must send a data: [DONE] sentinel or close with an error event so the client does not hang; back-pressure is handled by the async runtime — if the client reads slowly the write will await.

Low Level Design: LLM Inference Service

Introduction

Model Loading and GPU Memory

KV Cache

Continuous Batching

Tensor Parallelism

Quantization

Streaming Token Generation

Auto-Scaling