Caching Strategy — Low-Level Design
A caching strategy determines what to cache, for how long, how to invalidate stale data, and how to handle cache failures. Caching is one of the most impactful performance optimizations in backend systems and is a core topic in every system design interview.
Cache-Aside (Lazy Loading)
-- Most common pattern. Application manages cache explicitly.
-- Read path:
def get_user(user_id):
key = f'user:{user_id}'
cached = redis.get(key)
if cached:
return json.loads(cached) # Cache hit
user = db.execute("SELECT * FROM User WHERE id=%(id)s", {'id': user_id}).first()
if user:
redis.setex(key, 300, json.dumps(user)) # Cache for 5 min
return user
-- Write path: update DB, then invalidate cache
def update_user(user_id, data):
db.execute("UPDATE User SET ... WHERE id=%(id)s", data)
redis.delete(f'user:{user_id}') # Invalidate on write
-- Pros: only caches what is actually read; cache failure is non-fatal
-- Cons: cold start (first request after expiry hits DB); potential stale reads
-- between write and invalidation (very short window)
Write-Through Cache
-- Write to cache and DB simultaneously on every write.
-- Cache is always up to date.
def update_user(user_id, data):
db.execute("UPDATE User SET ... WHERE id=%(id)s", data)
user = db.execute("SELECT * FROM User WHERE id=%(id)s", {'id': user_id}).first()
redis.setex(f'user:{user_id}', 3600, json.dumps(user)) # Refresh cache
-- Pros: no stale reads; no cache miss after writes
-- Cons: write penalty (every write hits both DB and cache);
-- caches data that may never be read again (wasteful for write-heavy data)
Write-Behind (Write-Back) Cache
-- Write to cache first; flush to DB asynchronously.
-- Used when write throughput exceeds DB capacity.
def update_user_score(user_id, delta):
# Write to Redis first (fast)
new_score = redis.hincrby(f'user:{user_id}', 'score', delta)
# Queue DB flush
redis.sadd('dirty_users', user_id)
def flush_dirty_users():
"""Runs every 5 seconds via cron."""
dirty_ids = redis.smembers('dirty_users')
redis.delete('dirty_users')
for uid in dirty_ids:
score = redis.hget(f'user:{uid}', 'score')
db.execute("UPDATE User SET score=%(s)s WHERE id=%(id)s",
{'s': score, 'id': uid})
-- Pros: very fast writes; absorbs write spikes
-- Cons: data loss if Redis crashes before flush;
-- complexity in consistency guarantees
Cache Invalidation Strategies
1. TTL-based expiry (simplest):
redis.setex(key, 300, value)
Data goes stale for up to TTL seconds. Acceptable for non-critical reads.
2. Event-driven invalidation (most correct):
On every DB write that changes cached data:
redis.delete(affected_cache_key)
Works well for known, targeted invalidations.
3. Cache tags / dependency tracking:
Tag cache keys with entity IDs:
cache.set('post:123', data, tags=['user:42', 'category:5'])
On write: invalidate all keys tagged with the changed entity.
Requires a cache server that supports tagging (Redis via sets, Varnish, etc.)
4. Version keys:
key = f'user:{user_id}:v{version}'
Increment version on write. Old version keys become unreachable
(no delete needed; they expire via TTL).
Cache Stampede (Thundering Herd)
-- Problem: popular cache key expires. Thousands of requests hit
-- the DB simultaneously before the first request repopulates the cache.
-- Solution 1: Lock-based repopulation
def get_with_lock(key, fetch_fn, ttl=300):
cached = redis.get(key)
if cached:
return json.loads(cached)
lock_key = f'{key}:lock'
acquired = redis.set(lock_key, '1', nx=True, ex=10) # 10-second lock
if acquired:
try:
value = fetch_fn()
redis.setex(key, ttl, json.dumps(value))
return value
finally:
redis.delete(lock_key)
else:
# Another worker is fetching — wait briefly and retry
time.sleep(0.05)
return get_with_lock(key, fetch_fn, ttl)
-- Solution 2: Probabilistic early expiry (XFetch)
-- Randomly refresh cache slightly before it actually expires
-- so it never goes cold under load.
def get_xfetch(key, fetch_fn, ttl):
data, remaining_ttl = redis.get_with_ttl(key)
if data:
beta = 1.0 # tune: higher = more aggressive prefetch
if remaining_ttl - beta * math.log(random.random()) < 0:
pass # Don't refresh yet
else:
return json.loads(data) # Serve from cache
# Cache miss or probabilistic refresh: fetch from DB
value = fetch_fn()
redis.setex(key, ttl, json.dumps(value))
return value
Key Interview Points
- Cache-aside for reads, event invalidation for writes: This combination covers 90% of use cases correctly. Avoid write-through unless you have specific stale-read requirements — the write penalty adds latency to every write.
- TTL is a safety net, not the primary invalidation strategy: Rely on explicit deletes on write for low-latency consistency. Use TTL as a fallback to ensure no data is ever permanently stale if an invalidation is missed.
- Cache stampede is a real production outage cause: High-traffic caches expiring simultaneously bring down DBs. Use locks or probabilistic refresh. Add a random jitter to TTLs: ttl + random(0, 60) to prevent synchronized expiry.
- Cache is not a source of truth: Redis can lose data (eviction under memory pressure, crash). Design so a cache miss always falls back to the DB correctly. The system must work with an empty cache.
Caching strategy and distributed cache design is discussed in Netflix system design interview questions.
Caching strategy and news feed performance design is covered in Meta system design interview preparation.
Caching strategy and search performance design is discussed in Airbnb system design interview guide.