Low Level Design: Distributed Caching Layer

What Is a Distributed Caching Layer?

A distributed caching layer sits between your application and its primary data store, serving frequently accessed data from fast in-memory storage. Designing it well requires choosing the right topology, cache fill strategy, eviction policy, and dealing with operational hazards like cache stampedes and hot keys.

Cache Topology Options

Three topologies cover most use cases:

Single node — simplest setup, zero replication overhead, but a single point of failure and limited capacity.
Master-replica — one writable primary, one or more read replicas. Reads scale horizontally; writes remain a bottleneck. Replica lag means reads can be slightly stale.
Cluster with consistent hashing — data is partitioned across many nodes. Redis Cluster uses 16,384 hash slots assigned to node ranges. A key’s slot is computed as CRC16(key) mod 16384, so the owning node is deterministic. Adding or removing nodes triggers a slot migration, not a full rehash. Consistent hashing in client-side sharding (e.g., Ketama) achieves the same goal: only ~1/N keys remap when a node is added.

Cache Fill Strategies

Cache-Aside (Lazy Loading)

The application checks the cache first. On a miss it reads from the DB, then writes the result back to the cache. Reads are slightly slower on the first access, but the cache only holds data that’s actually been requested. This is the most common pattern.

Write-Through

Every write goes to the cache and the DB synchronously before the write is acknowledged. The cache is always consistent with the DB, but write latency increases and the cache fills with data that may never be read.

Write-Behind (Write-Back)

Writes go to the cache immediately and are flushed to the DB asynchronously. Extremely low write latency, but if the cache node crashes before flush, those writes are lost. Only appropriate for workloads that can tolerate some data loss (metrics, counters).

TTL Management and Eviction Policies

Every cached item should carry a TTL. Without it, stale data lingers indefinitely. TTL values are a trade-off: short TTLs increase DB load; long TTLs increase staleness.

When the cache is full, an eviction policy decides what to remove:

LRU (Least Recently Used) — evicts the item that hasn’t been accessed for the longest time. Good general-purpose default.
LFU (Least Frequently Used) — evicts the item with the fewest access counts. Better for skewed access patterns where a recently added item isn’t necessarily hot.
TTL-based passive expiry — items expire on access if their TTL has passed; Redis also does background sampling to proactively delete expired keys.

Redis exposes several eviction modes (allkeys-lru, volatile-lru, allkeys-lfu, noeviction). Choose allkeys-lru for general caches where all keys should be eviction candidates.

Cache Stampede Prevention

A cache stampede (thundering herd) occurs when a popular key expires and many concurrent requests simultaneously miss the cache, all hitting the DB at once.

Mutex / single-flight — the first goroutine (or thread) to detect a miss acquires a lock and fetches from DB. Other requesters wait and then read from the freshly populated cache. Simple but adds latency for waiters.
Probabilistic early expiry (XFetch) — before a key actually expires, requests have a growing probability of refreshing it early. The formula is: recompute if current_time – delta * beta * log(rand()) >= expiry_time. No locks required; works well for high-traffic keys.
Staggered TTLs — add random jitter to TTLs (e.g., base TTL ± 10%) so keys in a batch don’t all expire simultaneously.

Hot Key Problem and L1 Local Cache

A single hot key (e.g., a celebrity’s profile) can overwhelm one cache shard even in a cluster, since consistent hashing always routes that key to the same node.

Solutions:

Key replication — store the hot key under multiple names (key#0, key#1, … key#N) and randomly route reads across them. Writes must update all replicas.
In-process L1 cache — each application instance keeps a small in-memory cache (Caffeine in Java, functools.lru_cache in Python) with a short TTL (1–5 seconds). The distributed cache becomes L2. L1 absorbs the burst; consistency is slightly relaxed but usually acceptable.

Multi-Region Caching

In a multi-region deployment, each region has its own cache cluster. Cross-region cache replication is possible (Redis active-active with CRDTs) but introduces replication lag — writes in region A may not be visible in region B for tens to hundreds of milliseconds.

Design choices:

Accept eventual consistency for reads — most user-facing data tolerates brief staleness.
Write to origin only, let replication propagate. On cache miss in a remote region, read from the local DB replica rather than crossing regions.
Use region-local TTLs that are shorter than the replication lag window to bound staleness.

Redis Cluster Internals

Redis Cluster partitions 16,384 hash slots across master nodes. Each master can have one or more replicas. Clients are given a slot map on connect and can route directly to the right node. On a MOVED redirect, the client updates its slot map. Cluster topology changes (node failure, resharding) are gossiped among nodes. For operational simplicity, managed offerings (ElastiCache, Redis Cloud) handle slot management automatically.

Key Interview Talking Points

Always specify the cache fill strategy and justify it against the read/write ratio.
Mention TTL and eviction together — one without the other is incomplete.
Stampede prevention shows you’ve thought about high-concurrency edge cases.
Bring up the hot key problem for any high-traffic system and propose L1 local caching.
For multi-region, acknowledge replication lag and state your consistency model explicitly.

Frequently Asked Questions

What is a distributed caching layer in system design?

A distributed caching layer is a horizontally scalable tier of in-memory stores — such as Redis or Memcached — that sits between your application servers and your primary database. It stores frequently accessed data so that repeated reads can be served in microseconds rather than hitting slower persistent storage. The cache is distributed across multiple nodes to avoid a single point of failure and to scale read throughput beyond what a single machine can handle.

What is the difference between cache-aside, write-through, and write-behind caching?

Cache-aside (lazy loading) has the application check the cache first; on a miss it fetches from the database and populates the cache itself. Write-through writes data to the cache and the database synchronously on every write, keeping them consistent at the cost of higher write latency. Write-behind (write-back) writes to the cache immediately and asynchronously flushes to the database later, improving write throughput but risking data loss if the cache node fails before the flush completes.

How do you prevent cache stampede in a high-traffic system?

Cache stampede happens when many requests simultaneously miss a cold or expired key and all attempt to rebuild it at once. Common mitigations include: probabilistic early expiration (recompute before TTL expires based on a random threshold), mutex/locking so only one request rebuilds the key while others wait or serve stale data, and request coalescing at the application layer. Setting a short jitter on TTLs across keys also prevents synchronized mass expirations.

How do you handle hot keys in a Redis cluster?

Hot keys are individual keys that receive disproportionately high read or write traffic, creating a bottleneck on a single shard. Strategies include: local in-process caching at the application tier (L1 cache) so most requests never reach Redis; key sharding by appending a suffix (e.g., key:0 through key:N) and reading from a random shard; and read replicas dedicated to specific hot keys. For write-heavy hot keys, consider a counter aggregation pattern where increments are batched locally and periodically flushed.