System Design: Search Autocomplete (Google Typeahead)

Search autocomplete — the suggestions that appear as you type into Google, Amazon, or YouTube — is a deceptively interesting system design problem. It looks like a simple lookup, but at scale it requires careful data structure choices, a two-tier update pipeline, aggressive caching, and thoughtful ranking. Expect this question at companies where search is core to the product.

Step 1: Clarify Requirements

What triggers suggestions? Every keystroke (as-you-type) or on submit?
How many suggestions? Typically 5–10.
Ranking: Most popular globally? Personalized to the user? Location-aware?
Freshness: How quickly must trending searches appear? (Real-time vs. hourly vs. daily)
Scale: How many queries per second?
Languages / Unicode: ASCII only, or full international support?

Assume: 10B searches/day on Google → ~115K QPS. Each keystroke triggers a suggestion request; average search is 5 keystrokes → 575K autocomplete requests/sec. Return top 5 suggestions ranked by global query frequency. Updates are near-real-time (trending searches appear within ~15 minutes). English only for this design.

Step 2: Back-of-Envelope

Autocomplete QPS: 575,000 requests/sec (peak ~1.5M/sec)
Latency requirement:  100ms delay)
Data size:
  English vocabulary: ~170,000 words
  Common search queries: ~1 billion unique queries (long tail)
  Top 10M queries cover ~90% of traffic
  Avg query length: 20 chars
  10M queries × 20 chars = 200MB → fits in RAM

Step 3: Core Data Structure — The Trie

A trie (prefix tree) is the classic data structure for autocomplete. Each node represents a character; each path from root to a leaf represents a complete query. To find suggestions for prefix “se”, traverse to the “s” node then the “e” node, then enumerate all descendants.

         root
        /    
       s      p
       |      |
       e      y
      /      |
     a   n    t
     |   |    |
     r   d   (python: freq=9800)
    (search: freq=95000)
        (send: freq=12000)

Basic lookup:

def autocomplete(prefix, top_k=5):
    node = trie_root
    for char in prefix:
        if char not in node.children:
            return []            # no matches
        node = node.children[char]
    # DFS from current node, collect all complete queries
    return get_top_k(node, top_k)

Problem: DFS over all descendants to find the top-K is O(p) where p is the number of matching prefixes — potentially millions of nodes for a common prefix like “s”. Too slow at 575K QPS.

Optimized Trie: Store Top-K at Each Node

Store the top-K results directly at each trie node. When a query’s frequency is updated, propagate changes up the tree.

class TrieNode:
    def __init__(self):
        self.children = {}
        self.top_suggestions = []  # cached top-5 for this prefix

def autocomplete(prefix):
    node = traverse_to(prefix)
    return node.top_suggestions    # O(prefix_length) — fast

Lookup is now O(prefix length) — constant time for practical purposes. The trade-off: more memory (each node stores a list), and updates are more expensive (must update top-K at every ancestor node when a query’s frequency changes).

Step 4: Trie vs. Alternative Approaches

The trie is the textbook answer, but production systems often use different approaches:

Prefix Hash Table

Pre-compute suggestions for every possible prefix and store them in a hash table (or Redis):

hash["s"]    = ["search", "spotify", "snapchat", "samsung", "slack"]
hash["se"]   = ["search", "send", "setup", "settings", "search engine"]
hash["sea"]  = ["search", "seattle", "seaworld", "seal", "seafood"]

Lookup is O(1) — just a hash table get. The downside: massive storage. All prefixes of all queries. For 10M top queries with average length 20 chars → 200M prefix entries. At ~200 bytes each → ~40GB. Fits in a Redis cluster. This is what most real systems use at scale because the lookup is simpler and faster than tree traversal.

Elasticsearch / Search Index

Elasticsearch supports prefix queries and edge n-gram tokenization natively. Easier to operate and query with complex ranking logic (boost by recency + frequency + personalization). Trade-off: higher latency (~20–50ms) than an in-memory trie or Redis hash (~1–5ms).

Interview answer: Trie for the algorithm explanation; Redis prefix hash or Elasticsearch for the production implementation. State both and explain the trade-off.

Step 5: Two-Tier Update Pipeline

Query frequencies change constantly — trending searches must appear quickly. But rebuilding the entire trie on every query is impossible at 575K QPS.

Solution: real-time + batch pipeline.

Search queries
    ↓
Kafka "query_stream" topic
    ├─ Real-time aggregator (Apache Flink or Spark Streaming)
    │     Aggregates query counts in 15-minute windows
    │     Updates top-K in Redis prefix cache
    │     Handles trending queries quickly
    │
    └─ Batch aggregator (daily Spark job)
          Recomputes global frequency rankings over 30 days
          Rebuilds the full trie from scratch
          Pushes new trie to all autocomplete servers
          Handles long-term frequency accuracy

The real-time pipeline handles trending (breaking news, viral events appearing in suggestions within minutes). The batch pipeline handles accuracy for the long tail and prevents noisy real-time data from permanently skewing rankings.

Step 6: Caching Layer

At 575K RPS, even a Redis lookup needs caching at the application layer. The key insight: most autocomplete requests are for common prefixes, and they repeat constantly.

def autocomplete(prefix):
    # 1. Check local in-process LRU cache (per autocomplete server)
    result = local_cache.get(prefix)
    if result:
        return result               # sub-millisecond, no network hop

    # 2. Check Redis prefix cache
    result = redis.get(f"ac:{prefix}")
    if result:
        local_cache.set(prefix, result, ttl=60)
        return result

    # 3. Fallback: query trie service
    result = trie_service.query(prefix)
    redis.set(f"ac:{prefix}", result, ex=300)   # 5-minute TTL
    local_cache.set(prefix, result, ttl=60)
    return result

A two-level cache: in-process LRU (handles repeated keystrokes from the same server with no network call) and Redis (shared across all servers). The top-1000 prefixes account for ~80% of traffic — these stay hot in cache indefinitely.

Step 7: Handling Scale — Sharding the Trie

A single trie for 10M+ queries is ~10GB in memory. At 575K QPS, one server can’t handle it. Shard by prefix:

Shard A: prefixes starting with a–m
Shard B: prefixes starting with n–z

More fine-grained by first 2 characters:
  Shard 1: aa–am
  Shard 2: an–az
  ...
  (26² = 676 possible shards → route to ~100 physical shards)

A routing layer maps the prefix to the correct shard. Each shard is replicated 3× for availability. Use consistent hashing to distribute prefix ranges across shards.

Step 8: Personalization and Ranking

Global frequency is a baseline. Production systems layer on:

Personal history: Queries the user has searched before rank higher.
Location: “weather” → “weather [user’s city]” as top suggestion.
Recency: Recent queries get a boost (exponential decay on frequency).
Safe search filtering: Filter adult content for family accounts.

Personalization is typically applied as a re-ranking step on the server after fetching the global top-K.

High-Level Architecture

Client (browser/app)
    ↓ (debounced — only fire after 100ms pause between keystrokes)
CDN / Edge Cache (caches top prefixes geographically)
    ↓ cache miss
Load Balancer
    ↓
Autocomplete Service (stateless, ~50 instances)
    ├─ In-process LRU cache (top prefixes)
    ├─ Redis cluster (prefix → suggestion list)
    └─ Trie Service (sharded, for cache misses)
              ↑ updated by
    Realtime Aggregator (Flink, 15-min windows)
              ↑
    Kafka "query_stream"
              ↑
    Search Service (publishes every query)

Follow-up Questions

Q: How do you handle typos / fuzzy matching?
Exact prefix matching first. For zero results, fall back to fuzzy matching using edit distance (Levenshtein). At scale, pre-compute common misspellings and store them as aliases in the trie. Elasticsearch’s fuzzy query handles this natively.

Q: How do you filter offensive / illegal queries from suggestions?
Maintain a blocklist of banned terms. Apply at the ranking layer before returning results. Blocklist updates are pushed to all autocomplete servers via config management (no redeploy needed).

Q: How does client-side debouncing help?
Without debouncing, typing “search” fires 6 requests (s, se, sea, sear, searc, search) in under a second. Debouncing delays the request until the user pauses typing (typically 100–200ms). This reduces load by ~60% while being imperceptible to the user.

Q: How do you scale to 1M QPS?
Serve the top-10,000 prefixes from CDN edge caches globally. These cover ~95% of all requests without hitting your servers at all.

Summary

Autocomplete at scale uses an optimized trie or prefix hash table (stored in Redis) for O(1) lookups. A two-tier pipeline handles updates: real-time aggregation (Flink) for trending queries within minutes, batch (Spark) for accurate long-term frequency data. Two levels of caching — in-process LRU and Redis — absorb the read load. Shard the trie by prefix for horizontal scale. The top 10,000 prefixes can be CDN-cached globally, absorbing the vast majority of traffic at the edge.

Companies That Ask This System Design Question

This problem type commonly appears in interviews at:

See our company interview guides for full interview process, compensation, and preparation tips.