Distributed Cache System Low-Level Design

Caching is the single highest-leverage optimization in most systems. A well-designed cache can cut database load by 90% and reduce p99 latency from seconds to milliseconds. This post covers the full design space: write strategies, eviction policies, sharding, stampede prevention, and multi-tier architectures.

When to Cache

Cache when:

Read-to-write ratio is high (10:1 or more) – same data read repeatedly
Computations are expensive (aggregations, ML inference, rendering)
DB is the bottleneck and reads dominate
You need p99 latency <10ms – disk I/O cannot deliver this consistently

Do not cache when data changes on every read, consistency is critical (financial transactions), or data is user-specific and rarely re-read.

Cache-Aside (Lazy Loading)

The most common caching pattern. The application manages cache population.

def get_user(user_id: int) -> dict:
    # 1. Check cache
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)

    # 2. Cache miss - read from DB
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    if not user:
        return None

    # 3. Populate cache with TTL
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))
    return user

def update_user(user_id: int, data: dict):
    db.execute("UPDATE users SET ... WHERE id = %s", user_id)
    # Invalidate cache - do NOT update cache here (race condition risk)
    redis.delete(f"user:{user_id}")

Risk – cache stampede on cold start: If cache is empty and 10K requests hit simultaneously, all 10K go to DB. Solutions: pre-warm cache on startup, use mutex locking on miss (see thundering herd section).

Risk – stale reads: Between delete and next read, a concurrent reader may cache the old value if reads and deletes race. Use short TTLs as a safety net.

Write-Through

Write to cache and DB simultaneously. Cache is always consistent with DB.

def update_user_write_through(user_id: int, data: dict):
    # Write to DB
    db.execute("UPDATE users SET ... WHERE id = %s", user_id)
    # Write to cache immediately (not delete - write the new value)
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))

Pros: Cache always has fresh data, no cache misses after write.
Cons: Higher write latency (two writes per operation), cache polluted with data that may never be read.
Best for: Read-heavy workloads where the same data is written then frequently read (user profiles, product details).

Write-Behind (Write-Back)

Write to cache first, flush to DB asynchronously. Lowest write latency.

import asyncio
from collections import defaultdict

class WriteBehindCache:
    def __init__(self, flush_interval_sec=5):
        self.dirty = {}  # key -> value, not yet flushed
        self.flush_interval = flush_interval_sec

    def write(self, key: str, value: dict):
        redis.set(key, json.dumps(value))
        self.dirty[key] = value  # mark as dirty

    async def flush_loop(self):
        while True:
            await asyncio.sleep(self.flush_interval)
            if self.dirty:
                batch = dict(self.dirty)
                self.dirty.clear()
                for key, value in batch.items():
                    db.upsert(key, value)  # batch write to DB

Pros: Write latency = cache RTT only (~1ms vs 10-50ms for DB).
Cons: Data loss if cache node fails before flush, complex failure recovery, not suitable where DB is the system of record.
Best for: High-write workloads where some data loss is acceptable (analytics counters, view counts, session data).

LRU Implementation

LRU evicts the least recently used entry. O(1) get and put using doubly linked list + hash map.

class LRUCache:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.cache = {}  # key -> node
        # Sentinel head and tail for the doubly linked list
        self.head = {'key': None, 'val': None, 'prev': None, 'next': None}
        self.tail = {'key': None, 'val': None, 'prev': None, 'next': None}
        self.head['next'] = self.tail
        self.tail['prev'] = self.head

    def _remove(self, node):
        node['prev']['next'] = node['next']
        node['next']['prev'] = node['prev']

    def _insert_front(self, node):
        node['next'] = self.head['next']
        node['prev'] = self.head
        self.head['next']['prev'] = node
        self.head['next'] = node

    def get(self, key: int) -> int:
        if key not in self.cache:
            return -1
        node = self.cache[key]
        self._remove(node)
        self._insert_front(node)
        return node['val']

    def put(self, key: int, value: int):
        if key in self.cache:
            self._remove(self.cache[key])
        node = {'key': key, 'val': value, 'prev': None, 'next': None}
        self.cache[key] = node
        self._insert_front(node)
        if len(self.cache) > self.capacity:
            lru = self.tail['prev']
            self._remove(lru)
            del self.cache[lru['key']]

Python shortcut: from collections import OrderedDict – move_to_end() and popitem(last=False) give you LRU in ~10 lines. Use the full implementation above if the interviewer wants to see the data structure.

LFU vs LRU

LFU (Least Frequently Used) evicts the entry with the lowest access count. Resistant to “scan pollution” where a one-time sequential scan evicts hot data.

from collections import defaultdict

class LFUCache:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.key_val = {}       # key -> value
        self.key_freq = {}      # key -> frequency
        self.freq_keys = defaultdict(dict)  # freq -> OrderedDict of keys (for LRU within same freq)
        self.min_freq = 0
        # Using dict as ordered set (Python 3.7+ preserves insertion order)

    def _update_freq(self, key):
        freq = self.key_freq[key]
        del self.freq_keys[freq][key]
        if not self.freq_keys[freq] and freq == self.min_freq:
            self.min_freq += 1
        self.key_freq[key] = freq + 1
        self.freq_keys[freq + 1][key] = None

    def get(self, key: int) -> int:
        if key not in self.key_val:
            return -1
        self._update_freq(key)
        return self.key_val[key]

    def put(self, key: int, value: int):
        if self.capacity = self.capacity:
            # Evict least frequent (LRU among ties)
            evict_key, _ = next(iter(self.freq_keys[self.min_freq].items()))
            del self.freq_keys[self.min_freq][evict_key]
            del self.key_val[evict_key]
            del self.key_freq[evict_key]
        self.key_val[key] = value
        self.key_freq[key] = 1
        self.freq_keys[1][key] = None
        self.min_freq = 1

LRU vs LFU choice: LRU for general caches where recency predicts future access. LFU for caches where some items are fundamentally “hotter” (e.g., a product page for a viral item vs a niche item). Real systems like Redis use approximated LRU for performance.

Consistent Hashing

Distributes keys across cache nodes so that adding/removing a node only remaps k/n keys (k = total keys, n = number of nodes).

import hashlib
from bisect import bisect_right, insort

class ConsistentHashRing:
    def __init__(self, virtual_nodes=150):
        self.ring = []         # sorted list of virtual node hashes
        self.nodes = {}        # hash -> node_id
        self.virtual_nodes = virtual_nodes

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node_id: str):
        for i in range(self.virtual_nodes):
            h = self._hash(f"{node_id}#{i}")
            insort(self.ring, h)
            self.nodes[h] = node_id

    def remove_node(self, node_id: str):
        for i in range(self.virtual_nodes):
            h = self._hash(f"{node_id}#{i}")
            self.ring.remove(h)
            del self.nodes[h]

    def get_node(self, key: str) -> str:
        if not self.ring:
            return None
        h = self._hash(key)
        idx = bisect_right(self.ring, h) % len(self.ring)
        return self.nodes[self.ring[idx]]

Virtual nodes: Without virtual nodes, if you have 3 physical nodes they each own ~33% of the ring but the boundaries may be uneven due to hash distribution. With 150 virtual nodes per physical node, load distribution is much more uniform. Adding a node only moves ~1/n of keys on average.

Cache Stampede and Thundering Herd Prevention

When a popular key expires, hundreds of concurrent requests miss the cache and hammer the DB simultaneously.

Solution 1 – Mutex/Lock:

def get_with_mutex(key: str) -> str:
    value = redis.get(key)
    if value:
        return value

    lock_key = f"lock:{key}"
    acquired = redis.set(lock_key, "1", nx=True, ex=10)  # nx=only set if not exists

    if acquired:
        # This process recomputes
        value = db.query(...)
        redis.setex(key, 3600, value)
        redis.delete(lock_key)
        return value
    else:
        # Another process is recomputing - wait briefly and retry
        time.sleep(0.1)
        return redis.get(key) or db.query(...)  # fallback if still not ready

Solution 2 – Probabilistic Early Expiry (XFetch):

import math, random, time

def get_with_early_expiry(key: str, beta: float = 1.0):
    """
    Probabilistically refresh before TTL expires.
    Higher beta = more aggressive early refresh.
    """
    value, expiry, delta = redis.get_with_metadata(key)
    if value is None:
        return recompute_and_cache(key)

    # XFetch: refresh early with probability proportional to remaining TTL
    remaining = expiry - time.time()
    if -delta * beta * math.log(random.random()) >= remaining:
        # Recompute early (before expiry, under low load)
        return recompute_and_cache(key)

    return value

Solution 3 – Background refresh: A background job refreshes keys before they expire. No request ever sees a miss for hot keys.

TTL Strategy

Short TTL (seconds to minutes): Data is always fresh, but more DB load from frequent misses. Good for frequently-updated data (prices, inventory).
Long TTL (hours to days): Low DB load, but stale data risk. Good for rarely-changing data (product descriptions, user profiles).
Sliding TTL: Reset TTL on every access. Hot keys stay cached indefinitely. Risk: stale data for very hot keys that are never evicted.
No TTL (explicit invalidation): Cache is authoritative until explicitly deleted. Low miss rate but requires careful invalidation logic on every write path.

Rule of thumb: Start with TTL = acceptable staleness. If users can tolerate 5-minute-old prices, TTL = 300 seconds. Then tune based on cache hit rate and DB load metrics.

Multi-Tier Caching

class MultiTierCache:
    def __init__(self):
        self.l1 = {}           # in-process dict, ~1000 hottest keys, TTL=30s
        self.l1_expiry = {}
        self.l2 = redis_client  # Redis cluster, millions of keys, TTL=3600s

    def get(self, key: str):
        # L1 check (microseconds)
        if key in self.l1:
            if time.time() < self.l1_expiry[key]:
                return self.l1[key]
            else:
                del self.l1[key]
                del self.l1_expiry[key]

        # L2 check (sub-millisecond over LAN)
        value = self.l2.get(key)
        if value:
            # Populate L1 for next hit
            self.l1[key] = value
            self.l1_expiry[key] = time.time() + 30
            return value

        # L3: DB (milliseconds)
        value = db.query(key)
        if value:
            self.l2.setex(key, 3600, value)
            self.l1[key] = value
            self.l1_expiry[key] = time.time() + 30
        return value

L1 serves the hottest 0.1% of keys (Pareto principle – 80% of traffic). L1 TTL is short (30s) to prevent stale data buildup. This eliminates Redis RTT for the hottest keys.

Scale Numbers

Redis single node: 100K-200K ops/sec, ~100GB RAM on a large instance
Redis Cluster: Linear horizontal scaling, 1M+ ops/sec across a cluster
Read replicas: For read-heavy workloads, replicas serve reads (with eventual consistency lag)
Typical cache hit rate target: 95%+ for read-heavy services; below 90% the cache is not pulling its weight
Memory overhead: Redis uses ~70 bytes overhead per key. For 10M keys that is ~700MB overhead before values.
Network bottleneck before CPU: At 100K QPS with 1KB values, that is 100MB/s network throughput – plan accordingly

The fundamental tradeoff in cache design: consistency vs performance. Every pattern above makes a different point on that curve. Cache-aside with short TTL is a moderate compromise; write-through maximizes consistency; write-behind maximizes performance. Choose based on your application’s tolerance for staleness.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between cache-aside, write-through, and write-behind?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Cache-aside (lazy loading): application checks cache first; on miss, loads from DB and writes to cache. Cache may be stale if DB is updated by another process. Simple and widely used. Write-through: every write goes to both cache and DB synchronously. Cache always consistent with DB. Slower writes (both must complete). Good when reads heavily outnumber writes. Write-behind (write-back): writes go to cache only; DB is updated asynchronously. Lowest write latency. Risk: data loss if cache fails before DB flush. Requires careful flush logic (periodic + on eviction). Best for bursty writes where some loss is acceptable (session data, counters). Most production systems use cache-aside for reads and explicit cache invalidation or TTL on writes.”}},{“@type”:”Question”,”name”:”How do you implement an LRU cache in O(1) time?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use a doubly linked list (to track access order) plus a hash map (for O(1) key lookup). The list head holds the most recently used, tail holds least recently used. Get: if key in map, move the node to the head, return value. Put: if key in map, update value and move to head. If key not in map and at capacity, remove the tail node (LRU), delete from map, insert new node at head and add to map. All operations are O(1). In Python, use collections.OrderedDict: move_to_end(key) for access, popitem(last=False) to evict LRU. In Java, LinkedHashMap with accessOrder=true gives the same behavior. LC 146.”}},{“@type”:”Question”,”name”:”What is consistent hashing and why is it used for distributed caches?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Consistent hashing maps both cache nodes and keys to positions on a virtual ring (0 to 2^32). A key is assigned to the first node clockwise from its hash position. When a node is added or removed, only k/n keys need to be remapped (where k is total keys and n is total nodes). With simple modulo hashing, adding a node remaps nearly all keys, causing a cache stampede. Virtual nodes (150+ vnodes per physical node) distribute the load evenly and prevent hot spots when nodes differ in capacity. Consistent hashing is used by Redis Cluster, Cassandra, DynamoDB, and CDNs for cache shard assignment.”}},{“@type”:”Question”,”name”:”How do you prevent cache stampede (thundering herd)?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Cache stampede: a popular key expires, and thousands of requests simultaneously miss the cache, all hitting the database at once. Prevention strategies: (1) Mutex/lock on cache miss: only one process fetches from DB and repopulates the cache; others wait and then read from cache. Use Redis SET key lock NX EX 5 (acquire lock). (2) Probabilistic early expiry (XFetch): before the TTL expires, probabilistically start refreshing based on how close to expiry the entry is. The "hot" the key, the earlier the refresh starts. (3) Background refresh: a separate thread refreshes the cache before expiry, so the stale value is always served. (4) Staggered TTLs: add random jitter to TTLs to prevent synchronized mass expiration.”}},{“@type”:”Question”,”name”:”What is multi-tier caching and when is it worth the complexity?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Multi-tier caching stacks an in-process cache (L1) in front of a shared Redis cache (L2) in front of the database. L1 is a small dict or LRU cache in the application process memory (no network RTT, 100ns access). L2 is Redis (1ms RTT). L1 is useful when the same keys are read multiple times per request (e.g., user permissions, feature flags). L1 TTL should be much shorter than L2 TTL (1-5s vs 5-60min) to limit stale data exposure. Invalidation challenge: L1 caches on multiple servers do not know about Redis updates. Solutions: short TTL (accept brief staleness), Redis pub/sub invalidation messages to all app servers, or version tokens in the cache key (increment on write, clients cache per version). Worth it for keys read 1000+ times/second per server.”}}]}

Uber system design covers distributed caching for ride dispatch and pricing. See design patterns for Uber interview: distributed cache system design.

Netflix uses distributed caching at massive scale. See system design patterns for Netflix interview: distributed cache and content delivery design.

Shopify system design covers caching for high-traffic commerce. See design patterns for Shopify interview: caching and e-commerce system design.