Cache Warming Low-Level Design: Pre-warm Strategies, Stampede Prevention, and Scale-out Warm-up

Cache warming ensures a cache is pre-populated with frequently accessed data before users experience cache misses. Without warming, a fresh cache deployment causes a “cold start” storm — every request misses the cache and hits the database simultaneously, potentially causing cascading failures. Core challenges: identifying what to warm and in what order, avoiding thundering herds during deploys, warming new cache nodes when scaling out, and balancing pre-warm time against deployment velocity.

Core Data Model

-- Track cache warming job state
CREATE TABLE CacheWarmJob (
    job_id      UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    cache_type  TEXT NOT NULL,      -- 'product_catalog', 'user_profile', 'feed'
    status      TEXT NOT NULL DEFAULT 'pending',  -- 'pending','running','done','failed'
    total_keys  INT,
    warmed_keys INT NOT NULL DEFAULT 0,
    started_at  TIMESTAMPTZ,
    completed_at TIMESTAMPTZ,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Cache access frequency log (for identifying hot keys)
-- In practice: derived from Redis MONITOR sampling or application metrics
CREATE TABLE HotKeyLog (
    key_pattern TEXT NOT NULL,
    access_count BIGINT NOT NULL,
    sampled_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Strategy 1: Eager Pre-warm on Deploy

import redis, psycopg2
from typing import Iterator

r = redis.Redis(host='redis', decode_responses=True)

def warm_product_catalog(conn) -> int:
    """
    Pre-warm the top 10,000 most-viewed products before traffic is routed
    to a new cache instance or after a cache flush.
    Returns number of keys warmed.
    """
    with conn.cursor() as cur:
        # Query top products by view count from analytics (or access logs)
        cur.execute("""
            SELECT product_id, name, price_cents, description, image_url, category_id
            FROM Product
            WHERE is_active = TRUE
            ORDER BY view_count_30d DESC
            LIMIT 10000
        """)
        products = cur.fetchall()

    pipeline = r.pipeline(transaction=False)  # non-transactional for throughput
    count = 0
    for pid, name, price, desc, img, cat in products:
        key = f"product:{pid}"
        import json
        value = json.dumps({
            "product_id": str(pid), "name": name, "price_cents": price,
            "description": desc, "image_url": img, "category_id": str(cat)
        })
        pipeline.setex(key, 3600, value)  # 1-hour TTL
        count += 1
        if count % 500 == 0:
            pipeline.execute()  # flush every 500 to avoid huge pipeline buffer
            pipeline = r.pipeline(transaction=False)

    pipeline.execute()
    return count

def warm_user_sessions(conn, user_ids: list[str]) -> int:
    """
    Warm session/profile data for recently active users.
    Run before deploying a new cache server in a region.
    """
    with conn.cursor() as cur:
        cur.execute("""
            SELECT u.user_id, u.name, u.email, u.subscription_tier, s.session_token
            FROM User u
            JOIN UserSession s ON u.user_id = s.user_id
            WHERE u.user_id = ANY(%s)
              AND s.expires_at > NOW()
        """, (user_ids,))
        rows = cur.fetchall()

    import json
    pipeline = r.pipeline(transaction=False)
    for uid, name, email, tier, token in rows:
        profile_key = f"user:profile:{uid}"
        pipeline.setex(profile_key, 1800,
                        json.dumps({"user_id": str(uid), "name": name,
                                    "email": email, "tier": tier}))
    pipeline.execute()
    return len(rows)

Strategy 2: Lazy Warming with Stampede Prevention

import time, threading

LOCK_TTL_SEC = 10  # hold lock for max 10 seconds while fetching from DB

def get_with_stampede_prevention(conn, key: str, fetch_fn, ttl: int = 3600):
    """
    Cache-aside read with lock-based stampede prevention.
    Only one process fetches from DB on a cache miss; others wait briefly.
    """
    cached = r.get(key)
    if cached:
        return cached

    # Try to acquire a lock to be the sole fetcher
    lock_key = f"lock:{key}"
    acquired = r.set(lock_key, "1", nx=True, ex=LOCK_TTL_SEC)

    if acquired:
        try:
            value = fetch_fn(conn, key)
            if value:
                r.setex(key, ttl, value)
            return value
        finally:
            r.delete(lock_key)
    else:
        # Another process is fetching — wait briefly and retry
        for _ in range(20):  # wait up to 2 seconds (20 × 100ms)
            time.sleep(0.1)
            cached = r.get(key)
            if cached:
                return cached
        # Fallback: fetch directly without caching (degraded mode)
        return fetch_fn(conn, key)

def get_with_probabilistic_early_refresh(key: str, fetch_fn, ttl: int = 3600, beta: float = 1.0):
    """
    XFetch / probabilistic early expiration:
    Randomly recompute the value before TTL expires to avoid simultaneous expiry.
    Processes with higher beta values refresh earlier.
    """
    import math, random, json
    raw = r.get(key)
    if raw:
        data = json.loads(raw)
        # Store (value, computed_at, ttl) together
        remaining_ttl = r.ttl(key)
        delta = time.time() - data.get("_computed_at", time.time())
        # Refresh if: current_time - delta * beta * log(random()) > expiry_time
        if time.time() - delta * beta * math.log(random.random()) >= (time.time() + remaining_ttl):
            # Probabilistically refresh early
            fresh = fetch_fn(key)
            data["_value"] = fresh
            data["_computed_at"] = time.time()
            r.setex(key, ttl, json.dumps(data))
        return data.get("_value")
    return None

New Node Warm-up During Scale-out

def warm_new_cache_node(old_node_host: str, new_node_host: str,
                         key_pattern: str = "*", sample_rate: float = 0.1) -> int:
    """
    When adding a new Redis node, copy a sample of hot keys from an existing node.
    Full copy (MIGRATE) is too slow for large caches — copy only the hottest keys.
    """
    old_r = redis.Redis(host=old_node_host, decode_responses=True)
    new_r = redis.Redis(host=new_node_host, decode_responses=True)

    import random
    count = 0
    cursor = 0
    while True:
        cursor, keys = old_r.scan(cursor, match=key_pattern, count=1000)
        sampled = [k for k in keys if random.random() < sample_rate]
        for key in sampled:
            ttl = old_r.ttl(key)
            if ttl <= 0:
                continue  # Skip expired or no-TTL keys
            value = old_r.get(key)
            if value:
                new_r.setex(key, ttl, value)
                count += 1
        if cursor == 0:
            break
    return count

Key Interview Points

  • Thundering herd on cold start: If 1,000 processes simultaneously notice a cache miss for the same key, all 1,000 query the database at once. The lock-based prevention (only one process fetches, others wait) reduces this to 1 database query. The probabilistic early refresh (XFetch) solves the same problem by refreshing slightly before TTL expiry — spreading the refresh load over time rather than concentrating it at expiry.
  • What to warm: Not all keys are worth warming. Focus on: (1) top N% of keys by access frequency (Pareto: 20% of keys serve 80% of traffic); (2) keys with long computation time (JOIN-heavy queries, ML inference results); (3) keys with no fallback (cache miss = user-facing error). Identify hot keys from Redis MONITOR sampling, access logs, or application metrics (cache miss counter per key prefix).
  • Warm before traffic shift: In blue-green deployments, warm the green cache before switching DNS/load balancer. Measure cache hit rate on the green instance with canary traffic before full cutover. Target hit rate: same as production baseline (typically 95%+). Never cut over to a cold cache during peak traffic hours.
  • TTL staggering prevents mass expiry: If all product catalog entries are warmed at T=0 with TTL=3600s, they all expire at T=3600, causing a thundering herd. Add random jitter: TTL = base_ttl + random.randint(0, base_ttl // 5). This distributes expiry over a 20% window, preventing synchronized expiry storms.
  • Warm order matters: Warm in descending access frequency order — the most critical keys first. If the warming job is interrupted (crash, timeout), the cache is at least partially warm with the highest-value keys. Use a CacheWarmJob table to checkpoint progress and resume from where it stopped.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is a thundering herd and how does lock-based prevention solve it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A thundering herd occurs when many processes simultaneously detect the same cache miss and all query the database at once. Example: a Redis key for a popular product page expires at midnight; 5,000 concurrent requests all see cache miss; all 5,000 hit the database within the same 100ms window — the DB is overwhelmed. Lock-based prevention: when the cache key is missing, one process attempts to SET lock:{key} with NX (only if not exists) and a short TTL. Only the process that successfully sets the lock fetches from DB and writes to cache. All other processes see the lock exists and wait (retry loop) or use a stale cached value if available. The lock serializes fetches for the same key — only 1 DB query per cold key instead of N. Release the lock after writing to cache.”}},{“@type”:”Question”,”name”:”What is XFetch (probabilistic early expiration) and when should you use it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”XFetch recomputes a cache value probabilistically before its TTL expires, spreading the recomputation load over time rather than concentrating it at the expiry moment. Algorithm: for each cache read, compute: current_time – (delta × beta × log(random())). If this exceeds the key’s expiry time, refresh early. delta is the time it took to compute the value (longer computation → earlier refresh); beta controls eagerness (higher beta → more likely to refresh early). Benefit: recomputations are spread stochastically across many clients — no single expiry moment causes a spike. Best used for: high-cost-to-compute values (complex joins, ML inference), very high-traffic keys, and keys where stale data is acceptable for a few seconds. Not useful for: small caches or keys with very cheap DB reads.”}},{“@type”:”Question”,”name”:”How do you determine which keys to warm before a deploy?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Top keys by access frequency identify the highest-value warm targets. Sources: (1) Redis keyspace statistics — OBJECT FREQ key (requires LFU eviction policy: maxmemory-policy allkeys-lfu); (2) application-level counters: track cache.hit and cache.miss by key prefix in metrics (Prometheus/Datadog); (3) access logs: parse application logs and count URL/key patterns; (4) slow query log: most-common slow queries are candidates for caching. Segment by key type: product catalog, user profiles, search results, feed pre-computations. Prioritize: keys with high access count AND high cache miss cost (slow underlying query). Run the warm job in access-frequency order — most critical keys first. Stop warming if pre-warm time budget exceeds a threshold (e.g., 5 minutes).”}},{“@type”:”Question”,”name”:”How do you warm a new Redis replica added during horizontal scaling?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”When a new Redis node joins a cluster, it starts empty — hash slots migrated to it from other nodes bring their data, but new slots are cold. Strategy for gradual warm-up: (1) after slot migration, redirect a small percentage of traffic (10%) to the new node — cache misses are expected and fill the cache naturally (lazy warming); (2) run a background copy job that reads hot keys from the donor node (using SCAN + GET) and writes them to the new node — only keys with short remaining TTL are skipped (not worth warming); (3) increase traffic percentage as cache hit rate rises. Monitor: node-level cache hit rate (INFO stats → keyspace_hits / (keyspace_hits + keyspace_misses)). Target hit rate: ≥ 90% before taking full traffic.”}},{“@type”:”Question”,”name”:”How does TTL jitter prevent mass expiry storms after cache warming?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”After warming 100K product keys with TTL=3600s, all keys expire simultaneously 1 hour later — identical to the original thundering herd problem but delayed. TTL jitter randomizes expiry times: instead of setex(key, 3600, value), use setex(key, 3600 + random.randint(0, 720), value). The 720-second jitter (20% of base TTL) distributes expiry over a 12-minute window instead of all at once. At 100K keys / 12 minutes = ~140 expirations per second — each triggers at most 1 DB query (with lock-based prevention). Without jitter: 100K simultaneous expirations → 100K simultaneous DB queries at T+1 hour. The jitter range should be approximately 10–20% of the base TTL. Use secrets.randbelow() or random.randint() — not a predictable sequence.”}}]}

Cache warming and distributed caching design is discussed in Netflix system design interview questions.

Cache warming and data pipeline performance design is covered in Databricks system design interview preparation.

Cache warming and large-scale caching system design is discussed in Google system design interview guide.

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

Scroll to Top