Search autocomplete — the suggestions that appear as you type into Google, Amazon, or YouTube — is a deceptively interesting system design problem. It looks like a simple lookup, but at scale it requires careful data structure choices, a two-tier update pipeline, aggressive caching, and thoughtful ranking. Expect this question at companies where search is core to the product.
Step 1: Clarify Requirements
- What triggers suggestions? Every keystroke (as-you-type) or on submit?
- How many suggestions? Typically 5–10.
- Ranking: Most popular globally? Personalized to the user? Location-aware?
- Freshness: How quickly must trending searches appear? (Real-time vs. hourly vs. daily)
- Scale: How many queries per second?
- Languages / Unicode: ASCII only, or full international support?
Assume: 10B searches/day on Google → ~115K QPS. Each keystroke triggers a suggestion request; average search is 5 keystrokes → 575K autocomplete requests/sec. Return top 5 suggestions ranked by global query frequency. Updates are near-real-time (trending searches appear within ~15 minutes). English only for this design.
Step 2: Back-of-Envelope
Autocomplete QPS: 575,000 requests/sec (peak ~1.5M/sec)
Latency requirement: 100ms delay)
Data size:
English vocabulary: ~170,000 words
Common search queries: ~1 billion unique queries (long tail)
Top 10M queries cover ~90% of traffic
Avg query length: 20 chars
10M queries × 20 chars = 200MB → fits in RAM
Step 3: Core Data Structure — The Trie
A trie (prefix tree) is the classic data structure for autocomplete. Each node represents a character; each path from root to a leaf represents a complete query. To find suggestions for prefix “se”, traverse to the “s” node then the “e” node, then enumerate all descendants.
root
/
s p
| |
e y
/ |
a n t
| | |
r d (python: freq=9800)
(search: freq=95000)
(send: freq=12000)
Basic lookup:
def autocomplete(prefix, top_k=5):
node = trie_root
for char in prefix:
if char not in node.children:
return [] # no matches
node = node.children[char]
# DFS from current node, collect all complete queries
return get_top_k(node, top_k)
Problem: DFS over all descendants to find the top-K is O(p) where p is the number of matching prefixes — potentially millions of nodes for a common prefix like “s”. Too slow at 575K QPS.
Optimized Trie: Store Top-K at Each Node
Store the top-K results directly at each trie node. When a query’s frequency is updated, propagate changes up the tree.
class TrieNode:
def __init__(self):
self.children = {}
self.top_suggestions = [] # cached top-5 for this prefix
def autocomplete(prefix):
node = traverse_to(prefix)
return node.top_suggestions # O(prefix_length) — fast
Lookup is now O(prefix length) — constant time for practical purposes. The trade-off: more memory (each node stores a list), and updates are more expensive (must update top-K at every ancestor node when a query’s frequency changes).
Step 4: Trie vs. Alternative Approaches
The trie is the textbook answer, but production systems often use different approaches:
Prefix Hash Table
Pre-compute suggestions for every possible prefix and store them in a hash table (or Redis):
hash["s"] = ["search", "spotify", "snapchat", "samsung", "slack"]
hash["se"] = ["search", "send", "setup", "settings", "search engine"]
hash["sea"] = ["search", "seattle", "seaworld", "seal", "seafood"]
Lookup is O(1) — just a hash table get. The downside: massive storage. All prefixes of all queries. For 10M top queries with average length 20 chars → 200M prefix entries. At ~200 bytes each → ~40GB. Fits in a Redis cluster. This is what most real systems use at scale because the lookup is simpler and faster than tree traversal.
Elasticsearch / Search Index
Elasticsearch supports prefix queries and edge n-gram tokenization natively. Easier to operate and query with complex ranking logic (boost by recency + frequency + personalization). Trade-off: higher latency (~20–50ms) than an in-memory trie or Redis hash (~1–5ms).
Interview answer: Trie for the algorithm explanation; Redis prefix hash or Elasticsearch for the production implementation. State both and explain the trade-off.
Step 5: Two-Tier Update Pipeline
Query frequencies change constantly — trending searches must appear quickly. But rebuilding the entire trie on every query is impossible at 575K QPS.
Solution: real-time + batch pipeline.
Search queries
↓
Kafka "query_stream" topic
├─ Real-time aggregator (Apache Flink or Spark Streaming)
│ Aggregates query counts in 15-minute windows
│ Updates top-K in Redis prefix cache
│ Handles trending queries quickly
│
└─ Batch aggregator (daily Spark job)
Recomputes global frequency rankings over 30 days
Rebuilds the full trie from scratch
Pushes new trie to all autocomplete servers
Handles long-term frequency accuracy
The real-time pipeline handles trending (breaking news, viral events appearing in suggestions within minutes). The batch pipeline handles accuracy for the long tail and prevents noisy real-time data from permanently skewing rankings.
Step 6: Caching Layer
At 575K RPS, even a Redis lookup needs caching at the application layer. The key insight: most autocomplete requests are for common prefixes, and they repeat constantly.
def autocomplete(prefix):
# 1. Check local in-process LRU cache (per autocomplete server)
result = local_cache.get(prefix)
if result:
return result # sub-millisecond, no network hop
# 2. Check Redis prefix cache
result = redis.get(f"ac:{prefix}")
if result:
local_cache.set(prefix, result, ttl=60)
return result
# 3. Fallback: query trie service
result = trie_service.query(prefix)
redis.set(f"ac:{prefix}", result, ex=300) # 5-minute TTL
local_cache.set(prefix, result, ttl=60)
return result
A two-level cache: in-process LRU (handles repeated keystrokes from the same server with no network call) and Redis (shared across all servers). The top-1000 prefixes account for ~80% of traffic — these stay hot in cache indefinitely.
Step 7: Handling Scale — Sharding the Trie
A single trie for 10M+ queries is ~10GB in memory. At 575K QPS, one server can’t handle it. Shard by prefix:
Shard A: prefixes starting with a–m
Shard B: prefixes starting with n–z
More fine-grained by first 2 characters:
Shard 1: aa–am
Shard 2: an–az
...
(26² = 676 possible shards → route to ~100 physical shards)
A routing layer maps the prefix to the correct shard. Each shard is replicated 3× for availability. Use consistent hashing to distribute prefix ranges across shards.
Step 8: Personalization and Ranking
Global frequency is a baseline. Production systems layer on:
- Personal history: Queries the user has searched before rank higher.
- Location: “weather” → “weather [user’s city]” as top suggestion.
- Recency: Recent queries get a boost (exponential decay on frequency).
- Safe search filtering: Filter adult content for family accounts.
Personalization is typically applied as a re-ranking step on the server after fetching the global top-K.
High-Level Architecture
Client (browser/app)
↓ (debounced — only fire after 100ms pause between keystrokes)
CDN / Edge Cache (caches top prefixes geographically)
↓ cache miss
Load Balancer
↓
Autocomplete Service (stateless, ~50 instances)
├─ In-process LRU cache (top prefixes)
├─ Redis cluster (prefix → suggestion list)
└─ Trie Service (sharded, for cache misses)
↑ updated by
Realtime Aggregator (Flink, 15-min windows)
↑
Kafka "query_stream"
↑
Search Service (publishes every query)
Follow-up Questions
Q: How do you handle typos / fuzzy matching?
Exact prefix matching first. For zero results, fall back to fuzzy matching using edit distance (Levenshtein). At scale, pre-compute common misspellings and store them as aliases in the trie. Elasticsearch’s fuzzy query handles this natively.
Q: How do you filter offensive / illegal queries from suggestions?
Maintain a blocklist of banned terms. Apply at the ranking layer before returning results. Blocklist updates are pushed to all autocomplete servers via config management (no redeploy needed).
Q: How does client-side debouncing help?
Without debouncing, typing “search” fires 6 requests (s, se, sea, sear, searc, search) in under a second. Debouncing delays the request until the user pauses typing (typically 100–200ms). This reduces load by ~60% while being imperceptible to the user.
Q: How do you scale to 1M QPS?
Serve the top-10,000 prefixes from CDN edge caches globally. These cover ~95% of all requests without hitting your servers at all.
Summary
Autocomplete at scale uses an optimized trie or prefix hash table (stored in Redis) for O(1) lookups. A two-tier pipeline handles updates: real-time aggregation (Flink) for trending queries within minutes, batch (Spark) for accurate long-term frequency data. Two levels of caching — in-process LRU and Redis — absorb the read load. Shard the trie by prefix for horizontal scale. The top 10,000 prefixes can be CDN-cached globally, absorbing the vast majority of traffic at the edge.
Related System Design Topics
- Caching Strategies — cache-aside pattern for serving top-K suggestions
- Message Queues — async update pipeline for indexing new queries
- Consistent Hashing — distributing the trie across suggestion servers
- Database Sharding — partitioning the analytics log by query prefix
- Load Balancing — spreading autocomplete traffic across trie replicas
Companies That Ask This System Design Question
This problem type commonly appears in interviews at:
See our company interview guides for full interview process, compensation, and preparation tips.