How does Count-Min Sketch detect hot keys in a distributed cache?

A Count-Min Sketch (CMS) is a probabilistic frequency table with d hash functions and w counters per row. Each cache request increments d counter cells for the key; frequency is estimated as the minimum across those cells. Because the structure is fixed-size and mergeable, each cache node maintains a local CMS that is periodically aggregated across the fleet. Keys whose estimated frequency exceeds a configurable threshold are promoted to the hot-key mitigation path.

How does a local in-process cache mitigate hot key pressure?

A small in-process LRU or TinyLFU cache (typically a few hundred megabytes) in each service instance absorbs repetitive reads for detected hot keys before they reach the remote cache tier. Hot-key detection flags a key for local caching with a short TTL (seconds to tens of seconds) to limit staleness. This can reduce remote cache request rate by 99 %+ for celebrity-profile or trending-content scenarios without any client-visible behaviour change.

What is key splitting and how does it help with hot key mitigation?

Key splitting replicates a hot key's value across N virtual shards named key#0 through key#(N-1). Reads are distributed across shards by hashing a random or per-request salt, spreading load across N cache nodes instead of one. Writes update all N shards (fan-out) or use an invalidation strategy with lazy repopulation. An aggregation layer (or the client library) is aware of the shard count and reassembles results when the value is a counter or collection.

How does request coalescing with singleflight prevent cache stampedes on hot keys?

Singleflight (also called request coalescing or thundering-herd suppression) groups concurrent in-flight requests for the same key so that only one upstream fetch is issued. All waiters block on a shared future/promise and receive the same result when it resolves. This is implemented with a map of in-flight keys protected by a mutex. On a cache miss for a hot key, hundreds of concurrent requests collapse into a single backend call, eliminating the stampede that would otherwise overwhelm the origin.

Hot Key Mitigation Low-Level Design: Local Cache, Key Splitting, and Request Coalescing

⏱ 6 min read

Hot Key Mitigation Service: Overview and Requirements

A hot key occurs when a disproportionate fraction of cache or database traffic targets a single key. This creates a bottleneck on a single node even when the rest of the cluster is idle. A hot key mitigation service detects these keys automatically and applies layered defenses without requiring application code changes.

Functional Requirements

Detect hot keys in real time using request frequency monitoring.
Apply local in-process caching to absorb reads before they reach the shared cache tier.
Split a single logical key into N physical keys with transparent aggregation on write and read.
Coalesce concurrent requests for the same key into a single upstream fetch.
Provide a control plane API for manual key promotion and demotion.

Non-Functional Requirements

Detection latency must be under 5 seconds from hot-key onset to mitigation activation.
Local cache hit must add no more than 50 microseconds of overhead.
Key splitting must be transparent: callers use the original key and receive the correct aggregated value.
Coalescing must not introduce correctness issues under concurrent invalidation.

Data Model

HotKeyRegistry: key_hash, raw_key, detected_at, mitigation_type (local_cache | split | coalesce), split_factor, ttl_seconds, status.
LocalCacheEntry: key_hash, value_bytes, populated_at, expires_at, hit_count.
CoalesceSlot: key_hash, inflight_request_id, waiters (list of goroutine/thread handles), result, err, resolved_at.
FrequencyCounter: key_hash, window_start, count — stored in a sliding-window structure per proxy process.

Hot Key Detection Algorithm

Each cache proxy process maintains a Count-Min Sketch per 1-second time window. The sketch uses four hash functions over 2048 buckets, giving accurate frequency estimates with O(1) memory per key regardless of cardinality.

On each cache request the proxy increments the sketch counter for the request key. At the end of each window, keys whose estimated count exceeds a configurable threshold — for example, 1000 requests per second per proxy instance — are reported to a central aggregator. The aggregator sums counts across all proxy instances. If the cluster-wide frequency exceeds the hot-key threshold, the key is written to the HotKeyRegistry and mitigation activates within the next proxy heartbeat cycle, typically under 2 seconds.

Local In-Process Cache

Each proxy or application server maintains a small fixed-size LRU cache, typically 512 to 2048 entries, populated only for keys in the HotKeyRegistry. This cache is backed by a concurrent hash map with striped locking.

On a registry hit, the local cache is checked before any network call to the shared cache.
TTL for local cache entries is set to a fraction of the shared cache TTL, typically one-fifth, to bound staleness.
On write or invalidation of the original key, a pub-sub channel broadcasts invalidation to all proxy instances, which immediately evict the local entry.

Key Splitting with Aggregation

For mutable keys — counters, sets, or lists — the local cache alone is insufficient. Key splitting distributes a single logical key across N physical shards in the cache cluster:

On write, the value delta is applied to shard key:rand(0,N-1), chosen at random to distribute write load.
On read, the proxy fetches all N shards in parallel and aggregates: summing for counters, unioning for sets, or merging sorted lists.
N is chosen based on the hot-key frequency divided by the per-node write capacity, rounded up to the next power of two.
A background reconciliation job periodically re-balances shard values to prevent drift under non-uniform write patterns.

Request Coalescing

When a hot key entry is absent from the local cache and a fetch is required, multiple concurrent requests for the same key are collapsed into a single upstream call using a singleflight pattern:

The first request for a given key acquires ownership of a CoalesceSlot and issues the upstream fetch.
Subsequent requests for the same key during the fetch period register as waiters on the slot.
When the fetch completes, the result is broadcast to all waiters atomically.
If the fetch fails, all waiters receive the error immediately rather than each retrying independently, which would amplify load on a degraded backend.

The coalesce window is bounded by a maximum wait timeout. If the upstream fetch exceeds this timeout, waiters can optionally be served a stale value from the local cache if one exists.

API Design

GET /hotkeys — list currently detected hot keys with frequency, mitigation type, and split factor.
POST /hotkeys/{key}/promote — manually force a key into the hot-key registry with a specified mitigation type.
DELETE /hotkeys/{key} — remove a key from the registry and deactivate mitigation.
GET /hotkeys/{key}/stats — return hit rate, local cache hit ratio, coalesce collapse ratio, and split shard counts.

Scalability and Observability

Detection is fully distributed with no single point of coordination. The aggregator is lightweight and can be replicated. Local caches are process-local and require no cross-instance synchronization beyond invalidation pub-sub.

Key metrics: local cache hit ratio per hot key, coalesce collapse ratio (waiters served per upstream fetch), split read fan-out latency, detection latency from onset to registry write, and frequency sketch estimation error rate validated against exact counts sampled at low frequency.