CDN Cache System: Low-Level Design
A CDN cache sits between origin servers and end users, serving cached copies of static and dynamic content from edge nodes geographically close to the requester. The critical design challenges are cache key construction (what makes two requests cacheable as one), TTL management, purge propagation across hundreds of edge nodes, and handling the stampede when a popular cached object expires. This design covers the cache control layer, purge API, and cache warming patterns.
Core Data Model (Origin-Side Cache Manifest)
CREATE TABLE CacheRule (
rule_id SERIAL PRIMARY KEY,
url_pattern VARCHAR(500) NOT NULL, -- "/api/v1/products/*", "/static/**"
cache_ttl INT NOT NULL, -- seconds; 0 = do not cache
stale_ttl INT NOT NULL DEFAULT 0, -- serve stale while revalidating (stale-while-revalidate)
vary_headers TEXT[], -- ["Accept-Language", "Accept-Encoding"]
cache_control VARCHAR(200), -- raw Cache-Control header to send downstream
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE CachePurgeRequest (
purge_id BIGSERIAL PRIMARY KEY,
purge_type VARCHAR(20) NOT NULL, -- url, tag, prefix, all
purge_target VARCHAR(1000) NOT NULL, -- exact URL, tag name, or prefix
submitted_by BIGINT,
status VARCHAR(20) NOT NULL DEFAULT 'pending', -- pending, propagating, complete, failed
edge_acks INT NOT NULL DEFAULT 0,
total_edges INT NOT NULL DEFAULT 0,
submitted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ
);
CREATE TABLE CacheTag (
tag_id BIGSERIAL PRIMARY KEY,
cache_key VARCHAR(1000) NOT NULL,
tag VARCHAR(200) NOT NULL, -- "product:SKU-123", "category:electronics"
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
INDEX (tag)
);
CREATE INDEX ON CachePurgeRequest(status, submitted_at);
Cache Key Construction
import hashlib, urllib.parse
from typing import Optional
def build_cache_key(url: str, vary_headers: dict, rule: dict) -> str:
"""
Cache key = normalized URL + sorted vary headers.
Must be deterministic: same request always maps to the same key.
"""
# Normalize URL: sort query params to treat ?b=1&a=2 same as ?a=2&b=1
parsed = urllib.parse.urlparse(url)
params = sorted(urllib.parse.parse_qsl(parsed.query))
normalized_url = parsed._replace(query=urllib.parse.urlencode(params)).geturl()
# Include only the Vary headers specified in the cache rule
vary_values = []
for header in sorted(rule.get('vary_headers') or []):
value = vary_headers.get(header.lower(), '')
vary_values.append(f"{header}={value}")
raw_key = normalized_url + '|' + '|'.join(vary_values)
# Hash long keys to a fixed-length key for storage
return hashlib.sha256(raw_key.encode()).hexdigest()
# Example:
# key = build_cache_key(
# "https://api.example.com/products?sort=price&category=shoes",
# {"accept-language": "en-US", "accept-encoding": "gzip"},
# {"vary_headers": ["Accept-Language"]}
# )
# → sha256("https://api.example.com/products?category=shoes&sort=price|Accept-Language=en-US")
Cache Storage Layer (Edge Node)
import redis, time, json
from dataclasses import dataclass
from typing import Optional
r = redis.Redis(decode_responses=False) # binary for body storage
@dataclass
class CachedResponse:
status_code: int
headers: dict
body: bytes
cached_at: float
ttl: int
stale_ttl: int
tags: list
def get_cached(cache_key: str) -> Optional[CachedResponse]:
data = r.get(f"cdn:{cache_key}")
if not data:
return None
obj = json.loads(data)
return CachedResponse(**obj)
def set_cached(cache_key: str, resp: CachedResponse):
# TTL = cache_ttl + stale_ttl so we can serve stale during revalidation
total_ttl = resp.ttl + resp.stale_ttl
payload = json.dumps({
'status_code': resp.status_code,
'headers': resp.headers,
'body': resp.body.decode('latin-1'), # preserve binary bytes
'cached_at': resp.cached_at,
'ttl': resp.ttl,
'stale_ttl': resp.stale_ttl,
'tags': resp.tags,
})
r.setex(f"cdn:{cache_key}", total_ttl, payload)
# Index by tags for tag-based purge
for tag in resp.tags:
r.sadd(f"cdntag:{tag}", cache_key)
r.expire(f"cdntag:{tag}", total_ttl + 3600)
def is_fresh(resp: CachedResponse) -> bool:
age = time.time() - resp.cached_at
return age bool:
"""True if within stale-while-revalidate window."""
age = time.time() - resp.cached_at
return age CachedResponse:
"""
Cache hit flow:
1. Fresh → serve cached
2. Stale but within stale_ttl → serve cached + background revalidation
3. Expired completely → fetch from origin (stampede protection applies)
"""
cached = get_cached(cache_key)
if cached and is_fresh(cached):
return cached
if cached and is_stale_usable(cached):
# Serve stale immediately; trigger async background revalidation
_trigger_background_revalidation(cache_key, fetch_origin)
return cached
# Cache miss or fully expired → fetch from origin with stampede protection
return _fetch_with_lock(cache_key, fetch_origin)
def _fetch_with_lock(cache_key: str, fetch_origin) -> CachedResponse:
"""
Only one request fetches from origin when cache is empty.
Others wait on the lock and then read the just-populated cache.
"""
lock_key = f"cdn_lock:{cache_key}"
lock = r.set(lock_key, '1', nx=True, ex=5) # 5-second lock
if lock:
try:
resp = fetch_origin()
set_cached(cache_key, resp)
return resp
finally:
r.delete(lock_key)
else:
# Wait for the lock holder to populate cache
for _ in range(50): # max 5 seconds
time.sleep(0.1)
cached = get_cached(cache_key)
if cached:
return cached
# Fallback: fetch directly if lock holder failed
return fetch_origin()
Purge API
def purge_by_tag(tag: str, submitted_by: int) -> int:
"""
Purge all cached objects associated with a tag.
Used when a product is updated: purge tag "product:SKU-123" invalidates
all cached pages that include that product (PDP, search results, category pages).
"""
purge_id = db.fetchone("""
INSERT INTO CachePurgeRequest (purge_type, purge_target, submitted_by, status)
VALUES ('tag', %s, %s, 'pending') RETURNING purge_id
""", (tag, submitted_by))['purge_id']
# Broadcast to all edge nodes via pub/sub
purge_event = json.dumps({'purge_id': purge_id, 'type': 'tag', 'target': tag})
redis_pubsub.publish('cdn_purge', purge_event)
return purge_id
# Edge node purge handler (each edge node subscribes to cdn_purge channel)
def handle_purge_event(event: dict):
if event['type'] == 'tag':
tag = event['target']
cache_keys = r.smembers(f"cdntag:{tag}")
for key in cache_keys:
r.delete(f"cdn:{key.decode()}")
r.delete(f"cdntag:{tag}")
elif event['type'] == 'url':
r.delete(f"cdn:{build_cache_key(event['target'], {}, {})}")
elif event['type'] == 'prefix':
# Scan and delete matching keys — expensive, use sparingly
for key in r.scan_iter(match=f"cdn:*", count=200):
# Would need the original URL stored alongside to match prefix
pass
elif event['type'] == 'all':
r.flushdb() # nuclear option — full cache wipe
Key Design Decisions
- Tag-based purge over URL purge: when a product’s price changes, you can’t enumerate all URLs that show that product (search results page, category page, PDP, homepage featured section). Tag “product:SKU-123” is attached at cache time to every response that reads from that product. A single tag purge invalidates all of them atomically.
- stale-while-revalidate: serving stale content while asynchronously fetching fresh content eliminates the “cold cache” latency spike on TTL expiry. The user gets a fast (slightly stale) response; the background refresh populates fresh content for the next request. Cache-Control: max-age=60, stale-while-revalidate=30 means: fresh for 60s, serve stale for 30s more while revalidating.
- Stampede protection via Redis lock: when a popular cached object expires, thousands of requests simultaneously miss and all attempt to fetch from origin — a thundering herd. The nx=True Redis SET (set if not exists) ensures only one request fetches; others wait and read the just-populated cache. 5-second lock TTL prevents deadlock if the fetcher crashes.
- Purge propagation via pub/sub: broadcast purge events to all edge nodes via Redis pub/sub or a message queue. Track acks in CachePurgeRequest.edge_acks — a purge is “complete” when all edges acknowledge. Alert if propagation exceeds 30 seconds.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you choose the right TTL for different content types?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”TTL is a trade-off: longer TTL = lower origin load + faster responses; shorter TTL = fresher content + more purge flexibility. Practical guidelines by content type: (1) Static assets with content-hash in URL (app.a3f4b2.js, logo.c1d9e.png): cache forever (max-age=31536000, immutable) — the URL changes when content changes, so stale content is impossible; (2) API responses for public, slowly-changing data (product catalog, pricing): 60–300 seconds — short enough that a price change propagates in minutes; (3) User-personalized content (shopping cart, user profile): cache-control: private, no-store — must not be cached by CDN at all; (4) HTML pages with embedded version numbers: 60 seconds with stale-while-revalidate=60 — users get fast loads, and updates roll out within 2 minutes; (5) Images, fonts, CSS: 1 year with content-hash, 7 days without.”}},{“@type”:”Question”,”name”:”How does a CDN decide which edge node to serve a request from?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Anycast routing: the CDN operates many servers worldwide, all announced under the same IP address prefixes via BGP Anycast. When a user’s DNS resolver queries the CDN’s domain, BGP routing (not DNS) directs the request to the geographically closest edge node — the one with the fewest network hops from the user’s ISP. This works at the network layer, not application layer, so the selection happens before the first TCP packet reaches the CDN. Alternative: GeoDNS — the authoritative DNS server returns different IP addresses based on the requester’s IP geolocation. Anycast is generally preferred because it doesn’t require DNS TTL expiry for failover (BGP re-routes automatically), while GeoDNS requires waiting for DNS TTL to expire when re-routing. CDNs like Cloudflare use Anycast; older CDNs used GeoDNS.”}},{“@type”:”Question”,”name”:”How do you handle cache poisoning attacks on a CDN?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Cache poisoning: an attacker crafts a request that causes the CDN to cache a malicious response, which is then served to all subsequent users. Common vectors: (1) HTTP header injection — if the cache key doesn’t include a header that influences the response, an attacker can inject a malicious value via that header; (2) unkeyed inputs — query parameters stripped during normalization that still affect the response body. Mitigations: (1) strict cache key construction: include in the key ALL inputs that affect the response; (2) normalize and validate inputs before caching: reject requests with unusual header values; (3) Content-Security-Policy headers on cached responses limit damage from injected content; (4) use Vary headers explicitly — Cache-Control: Vary=Accept-Language means the CDN creates separate cache entries per language, preventing cross-language poisoning; (5) never cache 4xx responses except 404 (brief TTL) and 410 (long TTL for gone resources).”}},{“@type”:”Question”,”name”:”How does stale-while-revalidate improve perceived performance without sacrificing freshness?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Standard TTL-based caching has a "cold miss" problem: when a popular object expires, all in-flight requests simultaneously miss the cache and hammer the origin. stale-while-revalidate splits the TTL into two windows: the fresh window (serve cached) and the stale window (serve cached while fetching fresh in background). Example: Cache-Control: max-age=60, stale-while-revalidate=300. For the first 60 seconds: serve cached (fresh). For seconds 61–360: serve cached (stale) AND send one background revalidation request to origin. After 360 seconds: full cache miss, synchronous origin fetch. Result: users NEVER wait for an origin round-trip except for the very first request and rare full-expiry misses. The background revalidation keeps cache near-fresh (usually <1s stale for popular objects with many requests). Origin load drops dramatically vs. traditional TTL.”}},{“@type”:”Question”,”name”:”How do you purge a CDN cache during a hotfix deployment without a full cache wipe?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A full cache wipe (flush all) is nuclear: it instantly makes every cached object stale, causing a thundering herd to the origin. For a hotfix, use surgical purge: (1) URL purge: if you know exactly which URLs changed, purge those specific URLs — most CDNs offer an API endpoint (Cloudflare: DELETE /zones/{zone_id}/purge_cache with {"files": [url1, url2]}). (2) Tag purge: if your hotfix affects a category of objects tagged at cache time ("component:header", "product:SKU-123"), purge by tag. Most enterprise CDNs (Fastly, Akamai, Cloudflare Enterprise) support Surrogate-Key or Cache-Tag headers. (3) Cache-Control: no-cache with deploy hash: change the asset URL (version in filename) so new deploys are automatically served without purge. For emergency situations where none of these work: set a short TTL (30s) temporarily, wait for natural expiry, then restore the original TTL.”}}]}
CDN cache and content delivery system design is discussed in Netflix system design interview questions.
CDN cache and e-commerce asset delivery design is covered in Shopify system design interview preparation.
CDN cache and media delivery system design is discussed in Snap system design interview guide.