CDN Cache System: Low-Level Design
A CDN cache sits between origin servers and end users, serving cached copies of static and dynamic content from edge nodes geographically close to the requester. The critical design challenges are cache key construction (what makes two requests cacheable as one), TTL management, purge propagation across hundreds of edge nodes, and handling the stampede when a popular cached object expires. This design covers the cache control layer, purge API, and cache warming patterns.
Core Data Model (Origin-Side Cache Manifest)
CREATE TABLE CacheRule (
rule_id SERIAL PRIMARY KEY,
url_pattern VARCHAR(500) NOT NULL, -- "/api/v1/products/*", "/static/**"
cache_ttl INT NOT NULL, -- seconds; 0 = do not cache
stale_ttl INT NOT NULL DEFAULT 0, -- serve stale while revalidating (stale-while-revalidate)
vary_headers TEXT[], -- ["Accept-Language", "Accept-Encoding"]
cache_control VARCHAR(200), -- raw Cache-Control header to send downstream
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE CachePurgeRequest (
purge_id BIGSERIAL PRIMARY KEY,
purge_type VARCHAR(20) NOT NULL, -- url, tag, prefix, all
purge_target VARCHAR(1000) NOT NULL, -- exact URL, tag name, or prefix
submitted_by BIGINT,
status VARCHAR(20) NOT NULL DEFAULT 'pending', -- pending, propagating, complete, failed
edge_acks INT NOT NULL DEFAULT 0,
total_edges INT NOT NULL DEFAULT 0,
submitted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ
);
CREATE TABLE CacheTag (
tag_id BIGSERIAL PRIMARY KEY,
cache_key VARCHAR(1000) NOT NULL,
tag VARCHAR(200) NOT NULL, -- "product:SKU-123", "category:electronics"
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
INDEX (tag)
);
CREATE INDEX ON CachePurgeRequest(status, submitted_at);
Cache Key Construction
import hashlib, urllib.parse
from typing import Optional
def build_cache_key(url: str, vary_headers: dict, rule: dict) -> str:
"""
Cache key = normalized URL + sorted vary headers.
Must be deterministic: same request always maps to the same key.
"""
# Normalize URL: sort query params to treat ?b=1&a=2 same as ?a=2&b=1
parsed = urllib.parse.urlparse(url)
params = sorted(urllib.parse.parse_qsl(parsed.query))
normalized_url = parsed._replace(query=urllib.parse.urlencode(params)).geturl()
# Include only the Vary headers specified in the cache rule
vary_values = []
for header in sorted(rule.get('vary_headers') or []):
value = vary_headers.get(header.lower(), '')
vary_values.append(f"{header}={value}")
raw_key = normalized_url + '|' + '|'.join(vary_values)
# Hash long keys to a fixed-length key for storage
return hashlib.sha256(raw_key.encode()).hexdigest()
# Example:
# key = build_cache_key(
# "https://api.example.com/products?sort=price&category=shoes",
# {"accept-language": "en-US", "accept-encoding": "gzip"},
# {"vary_headers": ["Accept-Language"]}
# )
# → sha256("https://api.example.com/products?category=shoes&sort=price|Accept-Language=en-US")
Cache Storage Layer (Edge Node)
import redis, time, json
from dataclasses import dataclass
from typing import Optional
r = redis.Redis(decode_responses=False) # binary for body storage
@dataclass
class CachedResponse:
status_code: int
headers: dict
body: bytes
cached_at: float
ttl: int
stale_ttl: int
tags: list
def get_cached(cache_key: str) -> Optional[CachedResponse]:
data = r.get(f"cdn:{cache_key}")
if not data:
return None
obj = json.loads(data)
return CachedResponse(**obj)
def set_cached(cache_key: str, resp: CachedResponse):
# TTL = cache_ttl + stale_ttl so we can serve stale during revalidation
total_ttl = resp.ttl + resp.stale_ttl
payload = json.dumps({
'status_code': resp.status_code,
'headers': resp.headers,
'body': resp.body.decode('latin-1'), # preserve binary bytes
'cached_at': resp.cached_at,
'ttl': resp.ttl,
'stale_ttl': resp.stale_ttl,
'tags': resp.tags,
})
r.setex(f"cdn:{cache_key}", total_ttl, payload)
# Index by tags for tag-based purge
for tag in resp.tags:
r.sadd(f"cdntag:{tag}", cache_key)
r.expire(f"cdntag:{tag}", total_ttl + 3600)
def is_fresh(resp: CachedResponse) -> bool:
age = time.time() - resp.cached_at
return age bool:
"""True if within stale-while-revalidate window."""
age = time.time() - resp.cached_at
return age CachedResponse:
"""
Cache hit flow:
1. Fresh → serve cached
2. Stale but within stale_ttl → serve cached + background revalidation
3. Expired completely → fetch from origin (stampede protection applies)
"""
cached = get_cached(cache_key)
if cached and is_fresh(cached):
return cached
if cached and is_stale_usable(cached):
# Serve stale immediately; trigger async background revalidation
_trigger_background_revalidation(cache_key, fetch_origin)
return cached
# Cache miss or fully expired → fetch from origin with stampede protection
return _fetch_with_lock(cache_key, fetch_origin)
def _fetch_with_lock(cache_key: str, fetch_origin) -> CachedResponse:
"""
Only one request fetches from origin when cache is empty.
Others wait on the lock and then read the just-populated cache.
"""
lock_key = f"cdn_lock:{cache_key}"
lock = r.set(lock_key, '1', nx=True, ex=5) # 5-second lock
if lock:
try:
resp = fetch_origin()
set_cached(cache_key, resp)
return resp
finally:
r.delete(lock_key)
else:
# Wait for the lock holder to populate cache
for _ in range(50): # max 5 seconds
time.sleep(0.1)
cached = get_cached(cache_key)
if cached:
return cached
# Fallback: fetch directly if lock holder failed
return fetch_origin()
Purge API
def purge_by_tag(tag: str, submitted_by: int) -> int:
"""
Purge all cached objects associated with a tag.
Used when a product is updated: purge tag "product:SKU-123" invalidates
all cached pages that include that product (PDP, search results, category pages).
"""
purge_id = db.fetchone("""
INSERT INTO CachePurgeRequest (purge_type, purge_target, submitted_by, status)
VALUES ('tag', %s, %s, 'pending') RETURNING purge_id
""", (tag, submitted_by))['purge_id']
# Broadcast to all edge nodes via pub/sub
purge_event = json.dumps({'purge_id': purge_id, 'type': 'tag', 'target': tag})
redis_pubsub.publish('cdn_purge', purge_event)
return purge_id
# Edge node purge handler (each edge node subscribes to cdn_purge channel)
def handle_purge_event(event: dict):
if event['type'] == 'tag':
tag = event['target']
cache_keys = r.smembers(f"cdntag:{tag}")
for key in cache_keys:
r.delete(f"cdn:{key.decode()}")
r.delete(f"cdntag:{tag}")
elif event['type'] == 'url':
r.delete(f"cdn:{build_cache_key(event['target'], {}, {})}")
elif event['type'] == 'prefix':
# Scan and delete matching keys — expensive, use sparingly
for key in r.scan_iter(match=f"cdn:*", count=200):
# Would need the original URL stored alongside to match prefix
pass
elif event['type'] == 'all':
r.flushdb() # nuclear option — full cache wipe
Key Design Decisions
- Tag-based purge over URL purge: when a product’s price changes, you can’t enumerate all URLs that show that product (search results page, category page, PDP, homepage featured section). Tag “product:SKU-123” is attached at cache time to every response that reads from that product. A single tag purge invalidates all of them atomically.
- stale-while-revalidate: serving stale content while asynchronously fetching fresh content eliminates the “cold cache” latency spike on TTL expiry. The user gets a fast (slightly stale) response; the background refresh populates fresh content for the next request. Cache-Control: max-age=60, stale-while-revalidate=30 means: fresh for 60s, serve stale for 30s more while revalidating.
- Stampede protection via Redis lock: when a popular cached object expires, thousands of requests simultaneously miss and all attempt to fetch from origin — a thundering herd. The nx=True Redis SET (set if not exists) ensures only one request fetches; others wait and read the just-populated cache. 5-second lock TTL prevents deadlock if the fetcher crashes.
- Purge propagation via pub/sub: broadcast purge events to all edge nodes via Redis pub/sub or a message queue. Track acks in CachePurgeRequest.edge_acks — a purge is “complete” when all edges acknowledge. Alert if propagation exceeds 30 seconds.
). (2) Tag purge: if your hotfix affects a category of objects tagged at cache time ("component:header", "product:SKU-123"), purge by tag. Most enterprise CDNs (Fastly, Akamai, Cloudflare Enterprise) support Surrogate-Key or Cache-Tag headers. (3) Cache-Control: no-cache with deploy hash: change the asset URL (version in filename) so new deploys are automatically served without purge. For emergency situations where none of these work: set a short TTL (30s) temporarily, wait for natural expiry, then restore the original TTL.”}}]}
CDN cache and content delivery system design is discussed in Netflix system design interview questions.
CDN cache and e-commerce asset delivery design is covered in Shopify system design interview preparation.
CDN cache and media delivery system design is discussed in Snap system design interview guide.