Hot Key Mitigation Service: Overview and Requirements
A hot key occurs when a disproportionate fraction of cache or database traffic targets a single key. This creates a bottleneck on a single node even when the rest of the cluster is idle. A hot key mitigation service detects these keys automatically and applies layered defenses without requiring application code changes.
Functional Requirements
- Detect hot keys in real time using request frequency monitoring.
- Apply local in-process caching to absorb reads before they reach the shared cache tier.
- Split a single logical key into N physical keys with transparent aggregation on write and read.
- Coalesce concurrent requests for the same key into a single upstream fetch.
- Provide a control plane API for manual key promotion and demotion.
Non-Functional Requirements
- Detection latency must be under 5 seconds from hot-key onset to mitigation activation.
- Local cache hit must add no more than 50 microseconds of overhead.
- Key splitting must be transparent: callers use the original key and receive the correct aggregated value.
- Coalescing must not introduce correctness issues under concurrent invalidation.
Data Model
- HotKeyRegistry: key_hash, raw_key, detected_at, mitigation_type (local_cache | split | coalesce), split_factor, ttl_seconds, status.
- LocalCacheEntry: key_hash, value_bytes, populated_at, expires_at, hit_count.
- CoalesceSlot: key_hash, inflight_request_id, waiters (list of goroutine/thread handles), result, err, resolved_at.
- FrequencyCounter: key_hash, window_start, count — stored in a sliding-window structure per proxy process.
Hot Key Detection Algorithm
Each cache proxy process maintains a Count-Min Sketch per 1-second time window. The sketch uses four hash functions over 2048 buckets, giving accurate frequency estimates with O(1) memory per key regardless of cardinality.
On each cache request the proxy increments the sketch counter for the request key. At the end of each window, keys whose estimated count exceeds a configurable threshold — for example, 1000 requests per second per proxy instance — are reported to a central aggregator. The aggregator sums counts across all proxy instances. If the cluster-wide frequency exceeds the hot-key threshold, the key is written to the HotKeyRegistry and mitigation activates within the next proxy heartbeat cycle, typically under 2 seconds.
Local In-Process Cache
Each proxy or application server maintains a small fixed-size LRU cache, typically 512 to 2048 entries, populated only for keys in the HotKeyRegistry. This cache is backed by a concurrent hash map with striped locking.
- On a registry hit, the local cache is checked before any network call to the shared cache.
- TTL for local cache entries is set to a fraction of the shared cache TTL, typically one-fifth, to bound staleness.
- On write or invalidation of the original key, a pub-sub channel broadcasts invalidation to all proxy instances, which immediately evict the local entry.
Key Splitting with Aggregation
For mutable keys — counters, sets, or lists — the local cache alone is insufficient. Key splitting distributes a single logical key across N physical shards in the cache cluster:
- On write, the value delta is applied to shard
key:rand(0,N-1), chosen at random to distribute write load. - On read, the proxy fetches all N shards in parallel and aggregates: summing for counters, unioning for sets, or merging sorted lists.
- N is chosen based on the hot-key frequency divided by the per-node write capacity, rounded up to the next power of two.
- A background reconciliation job periodically re-balances shard values to prevent drift under non-uniform write patterns.
Request Coalescing
When a hot key entry is absent from the local cache and a fetch is required, multiple concurrent requests for the same key are collapsed into a single upstream call using a singleflight pattern:
- The first request for a given key acquires ownership of a CoalesceSlot and issues the upstream fetch.
- Subsequent requests for the same key during the fetch period register as waiters on the slot.
- When the fetch completes, the result is broadcast to all waiters atomically.
- If the fetch fails, all waiters receive the error immediately rather than each retrying independently, which would amplify load on a degraded backend.
The coalesce window is bounded by a maximum wait timeout. If the upstream fetch exceeds this timeout, waiters can optionally be served a stale value from the local cache if one exists.
API Design
GET /hotkeys— list currently detected hot keys with frequency, mitigation type, and split factor.POST /hotkeys/{key}/promote— manually force a key into the hot-key registry with a specified mitigation type.DELETE /hotkeys/{key}— remove a key from the registry and deactivate mitigation.GET /hotkeys/{key}/stats— return hit rate, local cache hit ratio, coalesce collapse ratio, and split shard counts.
Scalability and Observability
Detection is fully distributed with no single point of coordination. The aggregator is lightweight and can be replicated. Local caches are process-local and require no cross-instance synchronization beyond invalidation pub-sub.
Key metrics: local cache hit ratio per hot key, coalesce collapse ratio (waiters served per upstream fetch), split read fan-out latency, detection latency from onset to registry write, and frequency sketch estimation error rate validated against exact counts sampled at low frequency.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety