Caching Strategy — Low-Level Design
A caching strategy determines what to cache, for how long, how to invalidate stale data, and how to handle cache failures. Caching is one of the most impactful performance optimizations in backend systems and is a core topic in every system design interview.
Cache-Aside (Lazy Loading)
-- Most common pattern. Application manages cache explicitly.
-- Read path:
def get_user(user_id):
key = f'user:{user_id}'
cached = redis.get(key)
if cached:
return json.loads(cached) # Cache hit
user = db.execute("SELECT * FROM User WHERE id=%(id)s", {'id': user_id}).first()
if user:
redis.setex(key, 300, json.dumps(user)) # Cache for 5 min
return user
-- Write path: update DB, then invalidate cache
def update_user(user_id, data):
db.execute("UPDATE User SET ... WHERE id=%(id)s", data)
redis.delete(f'user:{user_id}') # Invalidate on write
-- Pros: only caches what is actually read; cache failure is non-fatal
-- Cons: cold start (first request after expiry hits DB); potential stale reads
-- between write and invalidation (very short window)
Write-Through Cache
-- Write to cache and DB simultaneously on every write.
-- Cache is always up to date.
def update_user(user_id, data):
db.execute("UPDATE User SET ... WHERE id=%(id)s", data)
user = db.execute("SELECT * FROM User WHERE id=%(id)s", {'id': user_id}).first()
redis.setex(f'user:{user_id}', 3600, json.dumps(user)) # Refresh cache
-- Pros: no stale reads; no cache miss after writes
-- Cons: write penalty (every write hits both DB and cache);
-- caches data that may never be read again (wasteful for write-heavy data)
Write-Behind (Write-Back) Cache
-- Write to cache first; flush to DB asynchronously.
-- Used when write throughput exceeds DB capacity.
def update_user_score(user_id, delta):
# Write to Redis first (fast)
new_score = redis.hincrby(f'user:{user_id}', 'score', delta)
# Queue DB flush
redis.sadd('dirty_users', user_id)
def flush_dirty_users():
"""Runs every 5 seconds via cron."""
dirty_ids = redis.smembers('dirty_users')
redis.delete('dirty_users')
for uid in dirty_ids:
score = redis.hget(f'user:{uid}', 'score')
db.execute("UPDATE User SET score=%(s)s WHERE id=%(id)s",
{'s': score, 'id': uid})
-- Pros: very fast writes; absorbs write spikes
-- Cons: data loss if Redis crashes before flush;
-- complexity in consistency guarantees
Cache Invalidation Strategies
1. TTL-based expiry (simplest):
redis.setex(key, 300, value)
Data goes stale for up to TTL seconds. Acceptable for non-critical reads.
2. Event-driven invalidation (most correct):
On every DB write that changes cached data:
redis.delete(affected_cache_key)
Works well for known, targeted invalidations.
3. Cache tags / dependency tracking:
Tag cache keys with entity IDs:
cache.set('post:123', data, tags=['user:42', 'category:5'])
On write: invalidate all keys tagged with the changed entity.
Requires a cache server that supports tagging (Redis via sets, Varnish, etc.)
4. Version keys:
key = f'user:{user_id}:v{version}'
Increment version on write. Old version keys become unreachable
(no delete needed; they expire via TTL).
Cache Stampede (Thundering Herd)
-- Problem: popular cache key expires. Thousands of requests hit
-- the DB simultaneously before the first request repopulates the cache.
-- Solution 1: Lock-based repopulation
def get_with_lock(key, fetch_fn, ttl=300):
cached = redis.get(key)
if cached:
return json.loads(cached)
lock_key = f'{key}:lock'
acquired = redis.set(lock_key, '1', nx=True, ex=10) # 10-second lock
if acquired:
try:
value = fetch_fn()
redis.setex(key, ttl, json.dumps(value))
return value
finally:
redis.delete(lock_key)
else:
# Another worker is fetching — wait briefly and retry
time.sleep(0.05)
return get_with_lock(key, fetch_fn, ttl)
-- Solution 2: Probabilistic early expiry (XFetch)
-- Randomly refresh cache slightly before it actually expires
-- so it never goes cold under load.
def get_xfetch(key, fetch_fn, ttl):
data, remaining_ttl = redis.get_with_ttl(key)
if data:
beta = 1.0 # tune: higher = more aggressive prefetch
if remaining_ttl - beta * math.log(random.random()) < 0:
pass # Don't refresh yet
else:
return json.loads(data) # Serve from cache
# Cache miss or probabilistic refresh: fetch from DB
value = fetch_fn()
redis.setex(key, ttl, json.dumps(value))
return value
Key Interview Points
- Cache-aside for reads, event invalidation for writes: This combination covers 90% of use cases correctly. Avoid write-through unless you have specific stale-read requirements — the write penalty adds latency to every write.
- TTL is a safety net, not the primary invalidation strategy: Rely on explicit deletes on write for low-latency consistency. Use TTL as a fallback to ensure no data is ever permanently stale if an invalidation is missed.
- Cache stampede is a real production outage cause: High-traffic caches expiring simultaneously bring down DBs. Use locks or probabilistic refresh. Add a random jitter to TTLs: ttl + random(0, 60) to prevent synchronized expiry.
- Cache is not a source of truth: Redis can lose data (eviction under memory pressure, crash). Design so a cache miss always falls back to the DB correctly. The system must work with an empty cache.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between cache-aside, write-through, and write-behind caching?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Cache-aside (lazy loading): application checks cache first; on miss, loads from DB and populates cache. Application manages cache explicitly. Pros: cache only what is read, resilient to cache failures. Cons: cold start — first request always misses; cache can be stale if DB is written directly. Write-through: every write goes to cache AND DB synchronously. Cache is always warm and consistent. Cons: write latency doubles; cache fills with data that may never be read. Write-behind (write-back): writes go to cache immediately, then async to DB. Lowest write latency. Cons: data loss if cache fails before flushing. Choose cache-aside for read-heavy workloads (most web apps), write-through for small frequently-read/written datasets (user settings), write-behind for high-write workloads that tolerate small loss windows (like counts, view increments).”}},{“@type”:”Question”,”name”:”How do you prevent a cache stampede (thundering herd)?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A cache stampede occurs when a popular cache key expires and thousands of concurrent requests all miss simultaneously, all querying the DB at once. Three solutions: (1) Probabilistic early expiration: instead of expiring at exactly T seconds, each request checks if NOW() > expiry – random(0, delta) and proactively refreshes. Only one request refreshes at a time on average. (2) Mutex lock: first cache miss acquires a distributed lock (SETNX in Redis), fetches from DB, populates cache, releases lock. Subsequent misses wait for the lock, then find the cache populated. (3) Stale-while-revalidate: serve the stale value immediately and refresh asynchronously in the background. The stale value is better than N concurrent DB queries.”}},{“@type”:”Question”,”name”:”How do you handle cache invalidation for complex objects?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Three strategies: (1) TTL-only: set a short TTL (30-60 seconds) and accept brief staleness. Simple, no invalidation logic. Works for read-heavy data that can tolerate lag. (2) Event-driven invalidation: when the source DB row changes, publish an event (via CDC or outbox) that deletes the cache key. Immediate consistency, complex to implement. (3) Versioned cache keys: embed a version number in the cache key (user:{id}:v{version}). On write, increment the version in a separate counter key. Reads always fetch the current version key. Old version keys expire via TTL. Avoids explicit deletes. The version counter itself can be in Redis: INCR user:{id}:version. Use versioned keys for objects with complex invalidation dependencies.”}},{“@type”:”Question”,”name”:”What data is worth caching and what should never be cached?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Worth caching: user profiles (read 100x more than written), product catalog (changes rarely, queried constantly), session data (read on every request), computed aggregates (follower counts, rating averages). Determine cache value as: (cache_hit_rate * read_qps * db_read_time) – (write_qps * invalidation_overhead). Cache is worth it when reads far outnumber writes and DB read latency is significant. Never cache: user authentication tokens (security), payment records (must be authoritative), real-time data that must be current (live auction prices), anything with regulatory requirements for consistency. Also avoid caching large objects (>1MB) — serialization overhead and Redis memory waste.”}},{“@type”:”Question”,”name”:”How do you size your Redis cache and handle eviction?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Size the cache to hold the working set: the data accessed by 80% of requests. Use Redis INFO keyspace to see hit/miss ratios. A hit rate below 90% means the cache is too small or the keys are too varied to benefit from caching. Common eviction policies: allkeys-lru (evict least recently used across all keys — use this for general caches), volatile-lru (evict LRU among keys with TTL — use when some keys must never be evicted), allkeys-lfu (evict least frequently used — better for Zipfian distributions with a small hot set). Set maxmemory and maxmemory-policy in redis.conf. Monitor used_memory_rss and evicted_keys metrics; alert if eviction rate is high (means the cache is undersized).”}}]}
Caching strategy and distributed cache design is discussed in Netflix system design interview questions.
Caching strategy and news feed performance design is covered in Meta system design interview preparation.
Caching strategy and search performance design is discussed in Airbnb system design interview guide.