Low Level Design: Tail Latency Optimization

Tail latency (p99, p999 latency) is the response time experienced by the slowest few percent of requests. While average latency may be 10ms, p99 latency of 500ms means 1 in 100 users waits 50x longer. At scale, every user making multiple requests will regularly hit the tail — making tail latency reduction as important as median latency for user experience.

Root Causes of Tail Latency

Common root causes: GC pauses (JVM stop-the-world GC can pause for hundreds of milliseconds), resource contention (CPU, disk, network bandwidth saturation), lock contention (database row locks, mutex contention in connection pools), head-of-line blocking (a slow request blocks queue processing for subsequent requests), background tasks (compaction in LSM databases, checkpoint in PostgreSQL), and variable load (a surge of requests fills the queue, increasing wait time for subsequent requests).

Hedged Requests

A hedged request sends the same request to multiple replicas simultaneously and uses the first response. The “hedging delay” is the threshold: if no response is received within the hedging delay (e.g., the p95 latency of 50ms), send the same request to another replica. Use the first response and cancel the other. This reduces tail latency by eliminating the variance at individual replicas — the client retries against a different replica instead of waiting for a slow one. The cost: increased load on the service (1-5% extra requests at p95).

Good Enough Results

For requests that fan out across multiple shards (search, recommendations), instead of waiting for all shards to respond, return results as soon as enough shards have responded. Example: a search across 100 shards returns results after 95 shards reply, discarding the slowest 5 shards' results. The user gets a slightly incomplete result set immediately rather than waiting for a slow shard. This technique (used by Google, Bing) trades result completeness for dramatically lower tail latency at scale.

Request Cancellation and Timeouts

Propagate cancellation signals through the entire call chain. When the client disconnects or the request exceeds its deadline, cancel all downstream work immediately. In gRPC, the context carries the deadline; each downstream call inherits it and aborts when the deadline is reached. Without cascading cancellation, timed-out requests continue consuming resources downstream — wasting compute on results the client will never see. This is especially important in fan-out scenarios where one slow shard triggers 99 other shards to do work for nothing.

Thread Pool Sizing

Thread pools that are too large cause context switching overhead and memory pressure, inflating p99 latency. Pools that are too small under-utilize CPU and create queuing. For CPU-bound work: thread pool size = number of CPU cores. For I/O-bound work (most web services): size = cores * (1 + wait_time / compute_time). Separate thread pools for different request types: fast requests (health checks, cached reads) and slow requests (complex queries, database writes) should not share a pool — slow requests block fast ones.

Reducing GC Pauses

JVM GC pauses are a major tail latency source. Mitigations: use low-latency GC algorithms (ZGC, Shenandoah) that pause for under 1ms regardless of heap size; reduce allocation rate (object pooling, off-heap storage for large caches); right-size the heap (too large = infrequent but long GCs; too small = frequent GCs). For Go: tune GOGC environment variable to balance memory usage and GC frequency. Profile GC logs to identify allocation hotspots before tuning.

Load Shedding

Under extreme load, accepting all requests and serving all of them slowly degrades p99 latency for everyone. Load shedding rejects excess requests (return 503) before the system becomes saturated. Adaptive load shedding measures current queue depth or latency; when it exceeds a threshold, begin rejecting new requests. This maintains acceptable latency for accepted requests at the cost of rejecting some. Use priority: shed low-priority requests (background analytics) before high-priority (user-facing). Rejected requests can be retried by the client.

Scroll to Top