Question 1

What is cache stampede and how do you prevent it?

Accepted Answer

Cache stampede (also called thundering herd) occurs when a popular cache entry expires and many concurrent requests simultaneously miss the cache, overwhelming the database with identical queries. Example: a product page cached with a 5-minute TTL receives 1000 requests per second. When the cache expires, all 1000 requests hit the database in the same second. Prevention strategies: (1) Distributed lock -- when the cache misses, acquire a lock (Redis SETNX). Only one request queries the database. Others wait for the lock release and then read from the refreshed cache. (2) Early probabilistic refresh -- before the TTL expires, each request has a small probability of refreshing the cache. The formula: should_refresh = current_time + beta * log(random()) * compute_time > expiry_time. This spreads refreshes randomly so the cache is almost never fully expired. (3) Stale-while-revalidate -- serve the stale cached value while refreshing asynchronously. All requests get a response immediately (the slightly stale cached value), and one background process fetches the fresh value. (4) Never expire popular keys -- use a background job to periodically refresh cache entries for popular items before they expire.

Question 2

What is the difference between cache-aside and write-through caching patterns?

Accepted Answer

Cache-aside (lazy loading): the application manages the cache. On read: check cache first, on miss query the database and populate the cache. On write: update the database, then invalidate (delete) the cache entry. The next read will miss and repopulate from the database. Advantages: only caches data that is actually read (no wasted memory), cache failures do not block operations (fall back to database). Disadvantages: the first request for each key always misses (cold start), and there is a window between the database write and cache invalidation where stale data may be served. Write-through: every write updates both the cache and the database synchronously. On read: always read from the cache (it is always up to date). Advantages: the cache is never stale relative to the database. Disadvantages: write latency increases (must write to two places), data that is written but never read wastes cache memory, and cache failures can block writes. Write-behind extends write-through by writing to the cache first and asynchronously persisting to the database. Lower write latency but risks data loss if the cache fails before the async persist. Cache-aside is the most common pattern and the recommended default for most applications.

Question 3

How does Redis Cluster distribute data across multiple nodes?

Accepted Answer

Redis Cluster divides the key space into 16,384 hash slots. Each key is mapped to a slot using: CRC16(key) % 16384. Each node in the cluster is responsible for a subset of these slots. For a 3-node cluster: node A handles slots 0-5460, node B handles 5461-10922, node C handles 10923-16383. Each primary node has one or more replica nodes for fault tolerance. Client routing: the Redis client (Jedis, ioredis, redis-py cluster mode) maintains a local copy of the slot-to-node mapping. When sending a command, the client hashes the key to find the slot and sends the command directly to the correct node. If the mapping is stale (after a resharding operation), the node returns a MOVED error with the correct node address. The client updates its local mapping and retries. Multi-key operations (MGET, MSET, transactions) require all keys to be on the same node. Use hash tags to force co-location: keys {user:1}:name and {user:1}:email hash to the same slot because Redis only hashes the substring within curly braces. This enables atomic operations on related keys.

Question 4

How do you handle cache invalidation in a distributed system?

Accepted Answer

Cache invalidation is famously one of the two hard problems in computer science. In a distributed system with multiple application servers sharing a cache, the challenge is ensuring all servers see consistent data. Strategies: (1) Delete on write -- when data is updated in the database, delete the corresponding cache entry. Do not update the cache (updating risks race conditions where a stale value overwrites a fresh one). The next read will miss and fetch the current value from the database. (2) Event-driven invalidation -- use CDC (Change Data Capture) or application events to publish data changes to Kafka. A cache invalidation consumer listens for these events and deletes the relevant cache entries. This decouples cache invalidation from the write path. (3) TTL as a safety net -- always set a TTL on cache entries even with explicit invalidation. If invalidation fails (network issue, bug), the TTL ensures the stale data expires within a bounded time. (4) Versioned cache keys -- include a version number in the cache key: user:123:v5. When data changes, increment the version. Old versions naturally expire via TTL. No explicit invalidation needed. In practice, combine delete-on-write with TTL safety nets. Accept that brief windows of staleness are inevitable and design the application to tolerate it.

System Design: Distributed Caching — Redis vs Memcached, Cache-Aside, Write-Through, Eviction, Cache Stampede

Redis vs Memcached: When to Use Each

Caching Patterns: Cache-Aside, Write-Through, Write-Behind

Cache Eviction Policies

Cache Stampede and Thundering Herd

Cache Consistency Challenges

Redis Cluster Architecture

Caching Strategy for System Design Interviews