What is a cache stampede and how does stale-while-revalidate prevent it?

Cache stampede (dogpile) occurs when a popular cache key expires and hundreds of servers simultaneously query the database to regenerate it — overwhelming the database. Stale-while-revalidate returns the stale cached value immediately (eliminating the stampede) while a background process regenerates the cache. The client receives a fast response (stale data), the database is hit only once (by the background regeneration), and the cache is updated before the next request. This works when data can tolerate seconds of staleness. For data that must be fresh (financial balances), use mutex locking instead.

What is request coalescing (singleflight) and how does it prevent thundering herds?

Request coalescing deduplicates simultaneous identical requests: if 100 requests arrive for the same cache miss, only one database query is issued. The other 99 wait and share the result when the first completes. Implementation: maintain a map of in-progress requests keyed by the request identifier. New requests check the map — if a request is in-progress for the same key, subscribe to its result instead of issuing a new database query. When the query completes, fan the result to all subscribers. Go's golang.org/x/sync/singleflight implements this. This eliminates the N-database-queries-per-cache-miss problem entirely, regardless of concurrency.

Why should cache TTLs have jitter added to them?

When many cache entries are populated simultaneously (batch warm-up, service restart, midnight cache refresh), they expire at the same time — causing a correlated stampede. TTL jitter randomizes expiration: instead of TTL=3600, use TTL=3600+random(-300,300). This spreads expirations over a 10-minute window, distributing cache regeneration load evenly over time. Without jitter, a service that warms 10,000 cache entries at startup will see all 10,000 expire at once one hour later. Apply jitter to all TTLs set in bulk operations — the overhead is negligible and the protection against correlated expirations is significant.

How does probabilistic early expiration (XFetch) prevent cache stampedes?

XFetch avoids the cache stampede by triggering early cache regeneration before the TTL expires. On each cache read, compute whether to recompute: recompute if -beta * delta * ln(random()) > remaining_TTL, where delta is the time to regenerate the value and beta is a tuning parameter (default 1.0). As the TTL decreases, the probability of early recomputation increases — spreading regeneration load across requests over time rather than concentrating it at the expiry moment. This means one request regenerates the cache before it expires, and subsequent requests after the TTL continue to find a fresh cached value. No mutex, no stampede.

Low Level Design: Thundering Herd Prevention

⏱ 6 min read

The thundering herd problem occurs when many processes or threads simultaneously wake up to compete for the same scarce resource — overwhelming the system. Common causes: a cache expiry that triggers simultaneous database queries from hundreds of servers, a circuit breaker opening that sends all traffic to a backup system simultaneously, or a spike of connections on server restart. Solving thundering herd requires preventing simultaneous contention before it occurs.

Cache Stampede

Cache stampede (cache dogpile) is the most common thundering herd variant. A popular cache key expires; simultaneously, hundreds of servers query the database to regenerate it — the database is overwhelmed by identical queries. Solutions: mutex locking (only one server may regenerate the cache at a time; others wait or return stale data), early cache expiration (probabilistic early recomputation — before the TTL actually expires, each request has a small probability of recomputing the cache; the probability increases as expiration approaches, spreading the regeneration load), and stale-while-revalidate (return the stale cached value immediately while a background process regenerates the cache — eliminates the stampede entirely; requires accepting briefly stale data).

Probabilistic Early Expiration

XFetch (Optimal Probabilistic Cache Stampede Prevention): when a cache entry is fetched, compute whether to regenerate it before expiry. Recompute if: -beta * delta * ln(random()) > time_to_expiry. Where delta = time to recompute the value (seconds), beta = tuning parameter (1.0 is a good default). This formula ensures that as time_to_expiry decreases, more requests trigger early recomputation. On average, recomputation happens once before expiry, spread randomly across requests — no stampede. The key insight: don’t wait for expiry; start recomputing probabilistically while the current value is still valid.

Mutex Lock Approach

When a cache miss occurs, acquire a distributed lock (Redis SET NX with TTL) before querying the database. If the lock is acquired, query the database, populate the cache, and release the lock. If the lock is not acquired (another process is regenerating), either: wait for the lock and then read the freshly-populated cache (reduces stampede but adds latency for waiters), or return the stale cached value (if available) while another process regenerates (no additional latency). The lock must have a TTL slightly longer than the maximum expected regeneration time — a crashed process holding the lock must not block indefinitely. Use a shorter lock TTL than the regeneration time with lock renewal (extend TTL while regeneration is in progress).

TTL Jitter

When many cache entries are set simultaneously (e.g., all users’ sessions at startup, all product prices at midnight batch update), they expire simultaneously — causing a mass stampede. TTL jitter randomizes expiry to spread the load: instead of TTL = 3600s, use TTL = 3600 + random(-300, 300). This ensures expirations are spread over a 10-minute window rather than all occurring at the same second. Apply jitter to: session TTLs (spread logouts across a window), cache warming TTLs (spread cache coldness), and cron-triggered cache invalidation (add jitter to the cron trigger). Jitter is a simple, effective defense against correlated cache expirations.

Thundering Herd on Server Restart

When a backend server restarts, clients that were connected re-establish connections simultaneously — overloading the server before it is ready. Solutions: exponential backoff with jitter in client reconnection logic (all clients wait different intervals before reconnecting); connection draining on shutdown (the server stops accepting new connections before terminating, giving existing clients time to reconnect to other servers); and graceful warm-up (the load balancer routes a small fraction of traffic to a new instance initially, increasing as the instance’s CPU and memory warm up — avoids the cold-start spike). Kubernetes readiness probes prevent traffic to pods that are not fully initialized.

Request Coalescing

Request coalescing deduplicates in-flight identical requests: if 100 requests for the same cache miss arrive simultaneously, only one query is sent to the database; the other 99 wait for the first to complete and return the same result. Implementation: a singleflight map keyed by the request’s cache key. When a request arrives: if a singleflight is already in-progress for this key, subscribe to the in-progress result (don’t issue a new query); if no singleflight exists, start a new one and issue the query. When the query completes, fan the result out to all waiting subscribers. Go’s golang.org/x/sync/singleflight package implements this pattern. Request coalescing eliminates duplicate database queries for the same data regardless of how many concurrent requests arrive.

Circuit Breaker Thundering Herd

When a circuit breaker opens and then transitions to half-open, it typically allows one request through. If the request succeeds, the breaker closes and all queued requests flood through simultaneously — a thundering herd at the dependency. Mitigations: half-open gradual recovery (allow a small percentage of requests through, not a single binary allow-all), exponential backoff before allowing requests through the half-open breaker (delay the first recovery attempt, then gradually allow more), and sliding window recovery (gradually increase the allowed fraction over a recovery period rather than snapping to 100% immediately). Configure the maximum number of concurrent requests allowed through in half-open state and increase gradually as the dependency proves healthy.