Typeahead search (also called autocomplete) is one of the most latency-sensitive features in a modern product. Users expect suggestions to appear in under 100 ms, which means the system must be pre-computed, cached aggressively, and sharded carefully. This post walks through a production-grade low-level design.
Requirements
- Return top-K (default 5) ranked suggestions for a given prefix in <100 ms p99.
- Handle 10,000 QPS at peak with personalized and global suggestions.
- Support at least 500 million distinct phrases in the global index.
- Personalization: weight the user’s own recent queries 2× over global frequency.
- Minimum prefix length: 2 characters (shorter inputs return nothing).
Core Data Structure: Trie with Embedded Top-K
A plain trie lets you find all words sharing a prefix, but traversing all descendants to rank them is O(n) per query — too slow. Instead, each trie node caches its own top-K list, maintained at index-build time.
class TrieNode:
children: Map[char, TrieNode]
top_k: List[SearchSuggestion] # precomputed, sorted by score desc
# top_k is updated during batch rebuild, not on every query
Query time becomes O(prefix_length) — you walk the trie to the prefix node, then read its top_k list directly. No traversal of descendants needed.
Data Models
SearchSuggestion:
phrase: String # e.g. "machine learning"
score: Float # global frequency score (normalized 0–1)
type: Enum # GLOBAL | TRENDING | PERSONAL
UserQueryHistory:
user_id: UUID
phrase: String
queried_at: Timestamp
count: Int # how many times this user typed this phrase
API
GET /suggest?q={prefix}&limit={k}&user_id={uid}
Response 200:
[
{"phrase": "machine learning", "score": 0.92},
{"phrase": "machine learning interview", "score": 0.87},
...
]
The user_id is optional. When absent, only global results are returned.
Ranking Formula
final_score(phrase, user) =
global_frequency_score(phrase) # base: log(query_count) / log(max_count)
+ recency_boost(phrase) # +0.1 if phrase trended in last 24 h
+ personalization_boost(phrase, user) # user_query_count * 2 * personal_weight
Personal weight is typically 0.15 — enough to surface a user’s own history without completely overriding global popularity.
Caching Strategy
- Global prefix cache: Redis hash. Key =
suggest:{prefix}, value = serialized top-K list. TTL = 60 seconds. Warmed on trie rebuild. - Per-user personalized cache: Redis hash. Key =
suggest:{user_id}:{prefix}, TTL = 30 seconds. Written lazily on first personalized request for a prefix. - Fallback: if personalized cache misses, merge global top-K with user’s recent history in-process and cache the result.
Trie Update Schedule
Updating the trie on every query would cause write contention across hundreds of nodes. Instead:
- Hourly batch rebuild: Spark job reads query logs from the last 24 h, aggregates phrase frequencies, rebuilds the full trie, and pushes it to all trie servers atomically via a blue/green swap.
- 5-minute hot-phrase update: A lighter job re-scores the top 10,000 phrases by recent velocity and patches only those nodes, avoiding a full rebuild.
Sharding
A single trie for 500 M phrases won’t fit in one machine’s memory (~50 GB+). Shard by the first character of the prefix (26 shards for a–z, plus one for digits/other). The query router hashes the first character and routes to the correct shard. For hot letters (e.g., “s”), further split by the first two characters using consistent hashing.
Edge Cases
- Profanity filter: Deny-list applied at index-build time; flagged phrases never enter the trie.
- Trending terms: Injected into top-K lists out-of-band by a trend-detection service that monitors query velocity spikes.
- Cold start for new users: No personal history — serve global results only.
- Unicode / CJK: Tokenize by character rather than byte; each CJK character is a valid prefix unit.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you store autocomplete suggestions efficiently for fast prefix lookup?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use a trie where each node stores the top-K (typically 5-10) suggestions for all queries passing through it. On insertion, propagate the new phrase up the trie and evict the lowest-scoring suggestion from each node's top-K list if it overflows. This means a lookup for any prefix requires only a single trie traversal to the prefix node – no further subtree scan needed. The top-K at each node is maintained as a min-heap. Storage: for a vocabulary of 1M phrases with average 8-character length, the trie has ~5M nodes; with top-10 at each node, total memory is manageable at ~500MB. Shard by first letter or consistent hash of prefix across multiple servers.”}},{“@type”:”Question”,”name”:”How do you rank autocomplete suggestions?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Ranking score = global_frequency_score + recency_boost + personalization_bonus. global_frequency_score: log(1 + query_count) normalized to 0-1 range. recency_boost: decay function on recent query counts (last 24h weighted 3x, last week 1.5x). personalization_bonus: if the user has typed or clicked this phrase before, add a fixed bonus (e.g., +0.3). Clamp total score to [0, 1]. Store scores in the trie nodes. Rebuild scores hourly from query logs using a batch job. For trending queries (last 1h spike > 5x baseline), apply a trending multiplier. Present personalized results first if available, fall back to global ranking.”}},{“@type”:”Question”,”name”:”How do you cache autocomplete responses at scale?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Two-level cache: (1) Global prefix cache: Redis sorted set per prefix, key=suggest:{prefix}, members=phrases, scores=ranking score, TTL=60s. Covers 80%+ of traffic since popular prefixes repeat constantly. (2) Per-user personalized cache: Redis hash per user, key=usersuggest:{user_id}:{prefix}, value=JSON list, TTL=5 minutes. Only cache if the user has enough query history to make personalization meaningful. CDN-level caching is not practical for personalized results. For cold-start prefixes not in cache, query the trie service directly and populate the cache. Evict stale entries on trie rebuild.”}},{“@type”:”Question”,”name”:”How do you update the trie with new queries in real time?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Do not update the trie on every query – at 10K QPS that would cause write contention. Instead: stream all search queries to Kafka. A batch consumer aggregates query counts over 5-minute windows and writes to a QueryFrequency table. A scheduler runs a trie rebuild job every hour: reads all phrases above a frequency threshold, computes updated scores, and builds a new trie in memory. The new trie is atomically swapped with the live trie (pointer swap, old trie garbage collected). For truly trending queries (spike detection), a real-time path runs every 5 minutes on a hot-phrases subset, patching only the changed nodes rather than full rebuild.”}},{“@type”:”Question”,”name”:”How do you handle typeahead search at 10,000 QPS with sub-100ms latency?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Architecture: stateless suggestion API servers behind a load balancer. Each server holds the full trie shard for its assigned prefix range in memory. Request routing: hash the prefix to determine the shard. Response: trie lookup is O(L) where L is prefix length – typically <10ms. Bottleneck is network + serialization. Optimizations: keep the trie in memory (not disk), use a lightweight serialization format (MessagePack, not JSON) for internal service calls, return only phrase strings and scores (no metadata). Deploy multiple replicas of each shard for redundancy. Total infrastructure: 8 shards x 3 replicas = 24 trie servers, each holding ~60MB of trie data. Load balancer routes by prefix hash to the correct shard.”}}]}
Google system design interviews cover search autocomplete at scale. See common questions for Google interview: search autocomplete and typeahead system design.
LinkedIn system design covers typeahead and people search. Review design patterns for LinkedIn interview: typeahead search and people search system design.
Uber system design includes location autocomplete and search. See design patterns for Uber interview: location search and autocomplete system design.