CDN Cache System Low-Level Design: Cache Keys, TTL Management, Tag Purge, and Stampede Protection

CDN Cache System: Low-Level Design

A CDN cache sits between origin servers and end users, serving cached copies of static and dynamic content from edge nodes geographically close to the requester. The critical design challenges are cache key construction (what makes two requests cacheable as one), TTL management, purge propagation across hundreds of edge nodes, and handling the stampede when a popular cached object expires. This design covers the cache control layer, purge API, and cache warming patterns.

Core Data Model (Origin-Side Cache Manifest)

CREATE TABLE CacheRule (
    rule_id        SERIAL PRIMARY KEY,
    url_pattern    VARCHAR(500) NOT NULL,  -- "/api/v1/products/*", "/static/**"
    cache_ttl      INT NOT NULL,           -- seconds; 0 = do not cache
    stale_ttl      INT NOT NULL DEFAULT 0, -- serve stale while revalidating (stale-while-revalidate)
    vary_headers   TEXT[],                 -- ["Accept-Language", "Accept-Encoding"]
    cache_control  VARCHAR(200),           -- raw Cache-Control header to send downstream
    created_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE CachePurgeRequest (
    purge_id       BIGSERIAL PRIMARY KEY,
    purge_type     VARCHAR(20) NOT NULL,  -- url, tag, prefix, all
    purge_target   VARCHAR(1000) NOT NULL, -- exact URL, tag name, or prefix
    submitted_by   BIGINT,
    status         VARCHAR(20) NOT NULL DEFAULT 'pending',  -- pending, propagating, complete, failed
    edge_acks      INT NOT NULL DEFAULT 0,
    total_edges    INT NOT NULL DEFAULT 0,
    submitted_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at   TIMESTAMPTZ
);

CREATE TABLE CacheTag (
    tag_id         BIGSERIAL PRIMARY KEY,
    cache_key      VARCHAR(1000) NOT NULL,
    tag            VARCHAR(200) NOT NULL,   -- "product:SKU-123", "category:electronics"
    created_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    INDEX (tag)
);

CREATE INDEX ON CachePurgeRequest(status, submitted_at);

Cache Key Construction

import hashlib, urllib.parse
from typing import Optional

def build_cache_key(url: str, vary_headers: dict, rule: dict) -> str:
    """
    Cache key = normalized URL + sorted vary headers.
    Must be deterministic: same request always maps to the same key.
    """
    # Normalize URL: sort query params to treat ?b=1&a=2 same as ?a=2&b=1
    parsed = urllib.parse.urlparse(url)
    params = sorted(urllib.parse.parse_qsl(parsed.query))
    normalized_url = parsed._replace(query=urllib.parse.urlencode(params)).geturl()

    # Include only the Vary headers specified in the cache rule
    vary_values = []
    for header in sorted(rule.get('vary_headers') or []):
        value = vary_headers.get(header.lower(), '')
        vary_values.append(f"{header}={value}")

    raw_key = normalized_url + '|' + '|'.join(vary_values)
    # Hash long keys to a fixed-length key for storage
    return hashlib.sha256(raw_key.encode()).hexdigest()

# Example:
# key = build_cache_key(
#     "https://api.example.com/products?sort=price&category=shoes",
#     {"accept-language": "en-US", "accept-encoding": "gzip"},
#     {"vary_headers": ["Accept-Language"]}
# )
# → sha256("https://api.example.com/products?category=shoes&sort=price|Accept-Language=en-US")

Cache Storage Layer (Edge Node)

import redis, time, json
from dataclasses import dataclass
from typing import Optional

r = redis.Redis(decode_responses=False)  # binary for body storage

@dataclass
class CachedResponse:
    status_code: int
    headers: dict
    body: bytes
    cached_at: float
    ttl: int
    stale_ttl: int
    tags: list

def get_cached(cache_key: str) -> Optional[CachedResponse]:
    data = r.get(f"cdn:{cache_key}")
    if not data:
        return None
    obj = json.loads(data)
    return CachedResponse(**obj)

def set_cached(cache_key: str, resp: CachedResponse):
    # TTL = cache_ttl + stale_ttl so we can serve stale during revalidation
    total_ttl = resp.ttl + resp.stale_ttl
    payload = json.dumps({
        'status_code': resp.status_code,
        'headers': resp.headers,
        'body': resp.body.decode('latin-1'),  # preserve binary bytes
        'cached_at': resp.cached_at,
        'ttl': resp.ttl,
        'stale_ttl': resp.stale_ttl,
        'tags': resp.tags,
    })
    r.setex(f"cdn:{cache_key}", total_ttl, payload)
    # Index by tags for tag-based purge
    for tag in resp.tags:
        r.sadd(f"cdntag:{tag}", cache_key)
        r.expire(f"cdntag:{tag}", total_ttl + 3600)

def is_fresh(resp: CachedResponse) -> bool:
    age = time.time() - resp.cached_at
    return age  bool:
    """True if within stale-while-revalidate window."""
    age = time.time() - resp.cached_at
    return age  CachedResponse:
    """
    Cache hit flow:
    1. Fresh → serve cached
    2. Stale but within stale_ttl → serve cached + background revalidation
    3. Expired completely → fetch from origin (stampede protection applies)
    """
    cached = get_cached(cache_key)

    if cached and is_fresh(cached):
        return cached

    if cached and is_stale_usable(cached):
        # Serve stale immediately; trigger async background revalidation
        _trigger_background_revalidation(cache_key, fetch_origin)
        return cached

    # Cache miss or fully expired → fetch from origin with stampede protection
    return _fetch_with_lock(cache_key, fetch_origin)

def _fetch_with_lock(cache_key: str, fetch_origin) -> CachedResponse:
    """
    Only one request fetches from origin when cache is empty.
    Others wait on the lock and then read the just-populated cache.
    """
    lock_key = f"cdn_lock:{cache_key}"
    lock = r.set(lock_key, '1', nx=True, ex=5)  # 5-second lock

    if lock:
        try:
            resp = fetch_origin()
            set_cached(cache_key, resp)
            return resp
        finally:
            r.delete(lock_key)
    else:
        # Wait for the lock holder to populate cache
        for _ in range(50):  # max 5 seconds
            time.sleep(0.1)
            cached = get_cached(cache_key)
            if cached:
                return cached
        # Fallback: fetch directly if lock holder failed
        return fetch_origin()

Purge API

def purge_by_tag(tag: str, submitted_by: int) -> int:
    """
    Purge all cached objects associated with a tag.
    Used when a product is updated: purge tag "product:SKU-123" invalidates
    all cached pages that include that product (PDP, search results, category pages).
    """
    purge_id = db.fetchone("""
        INSERT INTO CachePurgeRequest (purge_type, purge_target, submitted_by, status)
        VALUES ('tag', %s, %s, 'pending') RETURNING purge_id
    """, (tag, submitted_by))['purge_id']

    # Broadcast to all edge nodes via pub/sub
    purge_event = json.dumps({'purge_id': purge_id, 'type': 'tag', 'target': tag})
    redis_pubsub.publish('cdn_purge', purge_event)

    return purge_id

# Edge node purge handler (each edge node subscribes to cdn_purge channel)
def handle_purge_event(event: dict):
    if event['type'] == 'tag':
        tag = event['target']
        cache_keys = r.smembers(f"cdntag:{tag}")
        for key in cache_keys:
            r.delete(f"cdn:{key.decode()}")
        r.delete(f"cdntag:{tag}")
    elif event['type'] == 'url':
        r.delete(f"cdn:{build_cache_key(event['target'], {}, {})}")
    elif event['type'] == 'prefix':
        # Scan and delete matching keys — expensive, use sparingly
        for key in r.scan_iter(match=f"cdn:*", count=200):
            # Would need the original URL stored alongside to match prefix
            pass
    elif event['type'] == 'all':
        r.flushdb()  # nuclear option — full cache wipe

Key Design Decisions

  • Tag-based purge over URL purge: when a product’s price changes, you can’t enumerate all URLs that show that product (search results page, category page, PDP, homepage featured section). Tag “product:SKU-123” is attached at cache time to every response that reads from that product. A single tag purge invalidates all of them atomically.
  • stale-while-revalidate: serving stale content while asynchronously fetching fresh content eliminates the “cold cache” latency spike on TTL expiry. The user gets a fast (slightly stale) response; the background refresh populates fresh content for the next request. Cache-Control: max-age=60, stale-while-revalidate=30 means: fresh for 60s, serve stale for 30s more while revalidating.
  • Stampede protection via Redis lock: when a popular cached object expires, thousands of requests simultaneously miss and all attempt to fetch from origin — a thundering herd. The nx=True Redis SET (set if not exists) ensures only one request fetches; others wait and read the just-populated cache. 5-second lock TTL prevents deadlock if the fetcher crashes.
  • Purge propagation via pub/sub: broadcast purge events to all edge nodes via Redis pub/sub or a message queue. Track acks in CachePurgeRequest.edge_acks — a purge is “complete” when all edges acknowledge. Alert if propagation exceeds 30 seconds.

). (2) Tag purge: if your hotfix affects a category of objects tagged at cache time ("component:header", "product:SKU-123"), purge by tag. Most enterprise CDNs (Fastly, Akamai, Cloudflare Enterprise) support Surrogate-Key or Cache-Tag headers. (3) Cache-Control: no-cache with deploy hash: change the asset URL (version in filename) so new deploys are automatically served without purge. For emergency situations where none of these work: set a short TTL (30s) temporarily, wait for natural expiry, then restore the original TTL.”}}]}

CDN cache and content delivery system design is discussed in Netflix system design interview questions.

CDN cache and e-commerce asset delivery design is covered in Shopify system design interview preparation.

CDN cache and media delivery system design is discussed in Snap system design interview guide.

Scroll to Top