Question 1

What are hedged requests and when should you use them?

Accepted Answer

A hedged request sends the same request to multiple replicas simultaneously after a short delay (the hedging threshold, typically the p95 latency). The first response is used; the other requests are cancelled. This reduces tail latency by letting the client retry against a different replica instead of waiting for a slow one. Use when: the operation is idempotent (safe to execute on multiple replicas), you have multiple replicas, and the tail latency overhead (1-5% extra load) is acceptable.

Question 2

Why are GC pauses a major cause of tail latency and how do you mitigate them?

Accepted Answer

JVM garbage collectors periodically pause all application threads (stop-the-world) to reclaim memory. G1GC can pause for 50-500ms; during this pause, all requests in flight appear slow simultaneously, causing a tail latency spike. Mitigate with: low-latency GC algorithms (ZGC, Shenandoah) that pause for under 1ms regardless of heap size; reduce allocation rate (object pooling, off-heap caches); right-size the heap. Profile GC logs with GCViewer or GCEasy to identify allocation hotspots.

Question 3

What is load shedding and how does it help tail latency?

Accepted Answer

Load shedding rejects excess requests (returning 503) before the system becomes saturated. Under extreme load, accepting all requests and queueing them degrades p99 latency for everyone because requests wait in the queue. By rejecting requests when the queue or latency exceeds a threshold, the system maintains acceptable latency for accepted requests. Implement priority-based shedding: drop low-priority requests (background analytics) first while protecting user-facing requests.

Question 4

How does cascading timeout cancellation reduce wasted work in distributed systems?

Accepted Answer

When a request times out at the client or an upstream service, the downstream services may continue processing it — consuming CPU, database connections, and memory on work whose result will never be used. Propagating cancellation (via gRPC context deadlines or HTTP request context) stops downstream work immediately when the caller aborts. In fan-out scenarios (one request triggers 100 downstream calls), a single client disconnect without cascading cancellation could leave 100 downstream requests running uselessly.

Low Level Design: Tail Latency Optimization

Root Causes of Tail Latency

Hedged Requests

Good Enough Results

Request Cancellation and Timeouts

Thread Pool Sizing

Reducing GC Pauses

Load Shedding