Distributed Cache System Low-Level Design

Caching is the single highest-leverage optimization in most systems. A well-designed cache can cut database load by 90% and reduce p99 latency from seconds to milliseconds. This post covers the full design space: write strategies, eviction policies, sharding, stampede prevention, and multi-tier architectures.

When to Cache

Cache when:

  • Read-to-write ratio is high (10:1 or more) – same data read repeatedly
  • Computations are expensive (aggregations, ML inference, rendering)
  • DB is the bottleneck and reads dominate
  • You need p99 latency <10ms – disk I/O cannot deliver this consistently

Do not cache when data changes on every read, consistency is critical (financial transactions), or data is user-specific and rarely re-read.

Cache-Aside (Lazy Loading)

The most common caching pattern. The application manages cache population.

def get_user(user_id: int) -> dict:
    # 1. Check cache
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)

    # 2. Cache miss - read from DB
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    if not user:
        return None

    # 3. Populate cache with TTL
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))
    return user

def update_user(user_id: int, data: dict):
    db.execute("UPDATE users SET ... WHERE id = %s", user_id)
    # Invalidate cache - do NOT update cache here (race condition risk)
    redis.delete(f"user:{user_id}")

Risk – cache stampede on cold start: If cache is empty and 10K requests hit simultaneously, all 10K go to DB. Solutions: pre-warm cache on startup, use mutex locking on miss (see thundering herd section).

Risk – stale reads: Between delete and next read, a concurrent reader may cache the old value if reads and deletes race. Use short TTLs as a safety net.

Write-Through

Write to cache and DB simultaneously. Cache is always consistent with DB.

def update_user_write_through(user_id: int, data: dict):
    # Write to DB
    db.execute("UPDATE users SET ... WHERE id = %s", user_id)
    # Write to cache immediately (not delete - write the new value)
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))

Pros: Cache always has fresh data, no cache misses after write.
Cons: Higher write latency (two writes per operation), cache polluted with data that may never be read.
Best for: Read-heavy workloads where the same data is written then frequently read (user profiles, product details).

Write-Behind (Write-Back)

Write to cache first, flush to DB asynchronously. Lowest write latency.

import asyncio
from collections import defaultdict

class WriteBehindCache:
    def __init__(self, flush_interval_sec=5):
        self.dirty = {}  # key -> value, not yet flushed
        self.flush_interval = flush_interval_sec

    def write(self, key: str, value: dict):
        redis.set(key, json.dumps(value))
        self.dirty[key] = value  # mark as dirty

    async def flush_loop(self):
        while True:
            await asyncio.sleep(self.flush_interval)
            if self.dirty:
                batch = dict(self.dirty)
                self.dirty.clear()
                for key, value in batch.items():
                    db.upsert(key, value)  # batch write to DB

Pros: Write latency = cache RTT only (~1ms vs 10-50ms for DB).
Cons: Data loss if cache node fails before flush, complex failure recovery, not suitable where DB is the system of record.
Best for: High-write workloads where some data loss is acceptable (analytics counters, view counts, session data).

LRU Implementation

LRU evicts the least recently used entry. O(1) get and put using doubly linked list + hash map.

class LRUCache:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.cache = {}  # key -> node
        # Sentinel head and tail for the doubly linked list
        self.head = {'key': None, 'val': None, 'prev': None, 'next': None}
        self.tail = {'key': None, 'val': None, 'prev': None, 'next': None}
        self.head['next'] = self.tail
        self.tail['prev'] = self.head

    def _remove(self, node):
        node['prev']['next'] = node['next']
        node['next']['prev'] = node['prev']

    def _insert_front(self, node):
        node['next'] = self.head['next']
        node['prev'] = self.head
        self.head['next']['prev'] = node
        self.head['next'] = node

    def get(self, key: int) -> int:
        if key not in self.cache:
            return -1
        node = self.cache[key]
        self._remove(node)
        self._insert_front(node)
        return node['val']

    def put(self, key: int, value: int):
        if key in self.cache:
            self._remove(self.cache[key])
        node = {'key': key, 'val': value, 'prev': None, 'next': None}
        self.cache[key] = node
        self._insert_front(node)
        if len(self.cache) > self.capacity:
            lru = self.tail['prev']
            self._remove(lru)
            del self.cache[lru['key']]

Python shortcut: from collections import OrderedDict – move_to_end() and popitem(last=False) give you LRU in ~10 lines. Use the full implementation above if the interviewer wants to see the data structure.

LFU vs LRU

LFU (Least Frequently Used) evicts the entry with the lowest access count. Resistant to “scan pollution” where a one-time sequential scan evicts hot data.

from collections import defaultdict

class LFUCache:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.key_val = {}       # key -> value
        self.key_freq = {}      # key -> frequency
        self.freq_keys = defaultdict(dict)  # freq -> OrderedDict of keys (for LRU within same freq)
        self.min_freq = 0
        # Using dict as ordered set (Python 3.7+ preserves insertion order)

    def _update_freq(self, key):
        freq = self.key_freq[key]
        del self.freq_keys[freq][key]
        if not self.freq_keys[freq] and freq == self.min_freq:
            self.min_freq += 1
        self.key_freq[key] = freq + 1
        self.freq_keys[freq + 1][key] = None

    def get(self, key: int) -> int:
        if key not in self.key_val:
            return -1
        self._update_freq(key)
        return self.key_val[key]

    def put(self, key: int, value: int):
        if self.capacity = self.capacity:
            # Evict least frequent (LRU among ties)
            evict_key, _ = next(iter(self.freq_keys[self.min_freq].items()))
            del self.freq_keys[self.min_freq][evict_key]
            del self.key_val[evict_key]
            del self.key_freq[evict_key]
        self.key_val[key] = value
        self.key_freq[key] = 1
        self.freq_keys[1][key] = None
        self.min_freq = 1

LRU vs LFU choice: LRU for general caches where recency predicts future access. LFU for caches where some items are fundamentally “hotter” (e.g., a product page for a viral item vs a niche item). Real systems like Redis use approximated LRU for performance.

Consistent Hashing

Distributes keys across cache nodes so that adding/removing a node only remaps k/n keys (k = total keys, n = number of nodes).

import hashlib
from bisect import bisect_right, insort

class ConsistentHashRing:
    def __init__(self, virtual_nodes=150):
        self.ring = []         # sorted list of virtual node hashes
        self.nodes = {}        # hash -> node_id
        self.virtual_nodes = virtual_nodes

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node_id: str):
        for i in range(self.virtual_nodes):
            h = self._hash(f"{node_id}#{i}")
            insort(self.ring, h)
            self.nodes[h] = node_id

    def remove_node(self, node_id: str):
        for i in range(self.virtual_nodes):
            h = self._hash(f"{node_id}#{i}")
            self.ring.remove(h)
            del self.nodes[h]

    def get_node(self, key: str) -> str:
        if not self.ring:
            return None
        h = self._hash(key)
        idx = bisect_right(self.ring, h) % len(self.ring)
        return self.nodes[self.ring[idx]]

Virtual nodes: Without virtual nodes, if you have 3 physical nodes they each own ~33% of the ring but the boundaries may be uneven due to hash distribution. With 150 virtual nodes per physical node, load distribution is much more uniform. Adding a node only moves ~1/n of keys on average.

Cache Stampede and Thundering Herd Prevention

When a popular key expires, hundreds of concurrent requests miss the cache and hammer the DB simultaneously.

Solution 1 – Mutex/Lock:

def get_with_mutex(key: str) -> str:
    value = redis.get(key)
    if value:
        return value

    lock_key = f"lock:{key}"
    acquired = redis.set(lock_key, "1", nx=True, ex=10)  # nx=only set if not exists

    if acquired:
        # This process recomputes
        value = db.query(...)
        redis.setex(key, 3600, value)
        redis.delete(lock_key)
        return value
    else:
        # Another process is recomputing - wait briefly and retry
        time.sleep(0.1)
        return redis.get(key) or db.query(...)  # fallback if still not ready

Solution 2 – Probabilistic Early Expiry (XFetch):

import math, random, time

def get_with_early_expiry(key: str, beta: float = 1.0):
    """
    Probabilistically refresh before TTL expires.
    Higher beta = more aggressive early refresh.
    """
    value, expiry, delta = redis.get_with_metadata(key)
    if value is None:
        return recompute_and_cache(key)

    # XFetch: refresh early with probability proportional to remaining TTL
    remaining = expiry - time.time()
    if -delta * beta * math.log(random.random()) >= remaining:
        # Recompute early (before expiry, under low load)
        return recompute_and_cache(key)

    return value

Solution 3 – Background refresh: A background job refreshes keys before they expire. No request ever sees a miss for hot keys.

TTL Strategy

  • Short TTL (seconds to minutes): Data is always fresh, but more DB load from frequent misses. Good for frequently-updated data (prices, inventory).
  • Long TTL (hours to days): Low DB load, but stale data risk. Good for rarely-changing data (product descriptions, user profiles).
  • Sliding TTL: Reset TTL on every access. Hot keys stay cached indefinitely. Risk: stale data for very hot keys that are never evicted.
  • No TTL (explicit invalidation): Cache is authoritative until explicitly deleted. Low miss rate but requires careful invalidation logic on every write path.

Rule of thumb: Start with TTL = acceptable staleness. If users can tolerate 5-minute-old prices, TTL = 300 seconds. Then tune based on cache hit rate and DB load metrics.

Multi-Tier Caching

class MultiTierCache:
    def __init__(self):
        self.l1 = {}           # in-process dict, ~1000 hottest keys, TTL=30s
        self.l1_expiry = {}
        self.l2 = redis_client  # Redis cluster, millions of keys, TTL=3600s

    def get(self, key: str):
        # L1 check (microseconds)
        if key in self.l1:
            if time.time() < self.l1_expiry[key]:
                return self.l1[key]
            else:
                del self.l1[key]
                del self.l1_expiry[key]

        # L2 check (sub-millisecond over LAN)
        value = self.l2.get(key)
        if value:
            # Populate L1 for next hit
            self.l1[key] = value
            self.l1_expiry[key] = time.time() + 30
            return value

        # L3: DB (milliseconds)
        value = db.query(key)
        if value:
            self.l2.setex(key, 3600, value)
            self.l1[key] = value
            self.l1_expiry[key] = time.time() + 30
        return value

L1 serves the hottest 0.1% of keys (Pareto principle – 80% of traffic). L1 TTL is short (30s) to prevent stale data buildup. This eliminates Redis RTT for the hottest keys.

Scale Numbers

  • Redis single node: 100K-200K ops/sec, ~100GB RAM on a large instance
  • Redis Cluster: Linear horizontal scaling, 1M+ ops/sec across a cluster
  • Read replicas: For read-heavy workloads, replicas serve reads (with eventual consistency lag)
  • Typical cache hit rate target: 95%+ for read-heavy services; below 90% the cache is not pulling its weight
  • Memory overhead: Redis uses ~70 bytes overhead per key. For 10M keys that is ~700MB overhead before values.
  • Network bottleneck before CPU: At 100K QPS with 1KB values, that is 100MB/s network throughput – plan accordingly

The fundamental tradeoff in cache design: consistency vs performance. Every pattern above makes a different point on that curve. Cache-aside with short TTL is a moderate compromise; write-through maximizes consistency; write-behind maximizes performance. Choose based on your application’s tolerance for staleness.

Uber system design covers distributed caching for ride dispatch and pricing. See design patterns for Uber interview: distributed cache system design.

Netflix uses distributed caching at massive scale. See system design patterns for Netflix interview: distributed cache and content delivery design.

Shopify system design covers caching for high-traffic commerce. See design patterns for Shopify interview: caching and e-commerce system design.

Scroll to Top