Typeahead Search (Autocomplete) System Low-Level Design

Typeahead search (also called autocomplete) is one of the most latency-sensitive features in a modern product. Users expect suggestions to appear in under 100 ms, which means the system must be pre-computed, cached aggressively, and sharded carefully. This post walks through a production-grade low-level design.

Requirements

Return top-K (default 5) ranked suggestions for a given prefix in <100 ms p99.
Handle 10,000 QPS at peak with personalized and global suggestions.
Support at least 500 million distinct phrases in the global index.
Personalization: weight the user’s own recent queries 2× over global frequency.
Minimum prefix length: 2 characters (shorter inputs return nothing).

Core Data Structure: Trie with Embedded Top-K

A plain trie lets you find all words sharing a prefix, but traversing all descendants to rank them is O(n) per query — too slow. Instead, each trie node caches its own top-K list, maintained at index-build time.

class TrieNode:
    children: Map[char, TrieNode]
    top_k: List[SearchSuggestion]   # precomputed, sorted by score desc
    # top_k is updated during batch rebuild, not on every query

Query time becomes O(prefix_length) — you walk the trie to the prefix node, then read its top_k list directly. No traversal of descendants needed.

Data Models

SearchSuggestion:
    phrase:     String        # e.g. "machine learning"
    score:      Float         # global frequency score (normalized 0–1)
    type:       Enum          # GLOBAL | TRENDING | PERSONAL

UserQueryHistory:
    user_id:    UUID
    phrase:     String
    queried_at: Timestamp
    count:      Int           # how many times this user typed this phrase

API

GET /suggest?q={prefix}&limit={k}&user_id={uid}

Response 200:
[
  {"phrase": "machine learning", "score": 0.92},
  {"phrase": "machine learning interview", "score": 0.87},
  ...
]

The user_id is optional. When absent, only global results are returned.

Ranking Formula

final_score(phrase, user) =
    global_frequency_score(phrase)          # base: log(query_count) / log(max_count)
  + recency_boost(phrase)                   # +0.1 if phrase trended in last 24 h
  + personalization_boost(phrase, user)     # user_query_count * 2 * personal_weight

Personal weight is typically 0.15 — enough to surface a user’s own history without completely overriding global popularity.

Caching Strategy

Global prefix cache: Redis hash. Key = suggest:{prefix}, value = serialized top-K list. TTL = 60 seconds. Warmed on trie rebuild.
Per-user personalized cache: Redis hash. Key = suggest:{user_id}:{prefix}, TTL = 30 seconds. Written lazily on first personalized request for a prefix.
Fallback: if personalized cache misses, merge global top-K with user’s recent history in-process and cache the result.

Trie Update Schedule

Updating the trie on every query would cause write contention across hundreds of nodes. Instead:

Hourly batch rebuild: Spark job reads query logs from the last 24 h, aggregates phrase frequencies, rebuilds the full trie, and pushes it to all trie servers atomically via a blue/green swap.
5-minute hot-phrase update: A lighter job re-scores the top 10,000 phrases by recent velocity and patches only those nodes, avoiding a full rebuild.

Sharding

A single trie for 500 M phrases won’t fit in one machine’s memory (~50 GB+). Shard by the first character of the prefix (26 shards for a–z, plus one for digits/other). The query router hashes the first character and routes to the correct shard. For hot letters (e.g., “s”), further split by the first two characters using consistent hashing.

Edge Cases

Profanity filter: Deny-list applied at index-build time; flagged phrases never enter the trie.
Trending terms: Injected into top-K lists out-of-band by a trend-detection service that monitors query velocity spikes.
Cold start for new users: No personal history — serve global results only.
Unicode / CJK: Tokenize by character rather than byte; each CJK character is a valid prefix unit.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you store autocomplete suggestions efficiently for fast prefix lookup?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use a trie where each node stores the top-K (typically 5-10) suggestions for all queries passing through it. On insertion, propagate the new phrase up the trie and evict the lowest-scoring suggestion from each node's top-K list if it overflows. This means a lookup for any prefix requires only a single trie traversal to the prefix node – no further subtree scan needed. The top-K at each node is maintained as a min-heap. Storage: for a vocabulary of 1M phrases with average 8-character length, the trie has ~5M nodes; with top-10 at each node, total memory is manageable at ~500MB. Shard by first letter or consistent hash of prefix across multiple servers.”}},{“@type”:”Question”,”name”:”How do you rank autocomplete suggestions?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Ranking score = global_frequency_score + recency_boost + personalization_bonus. global_frequency_score: log(1 + query_count) normalized to 0-1 range. recency_boost: decay function on recent query counts (last 24h weighted 3x, last week 1.5x). personalization_bonus: if the user has typed or clicked this phrase before, add a fixed bonus (e.g., +0.3). Clamp total score to [0, 1]. Store scores in the trie nodes. Rebuild scores hourly from query logs using a batch job. For trending queries (last 1h spike > 5x baseline), apply a trending multiplier. Present personalized results first if available, fall back to global ranking.”}},{“@type”:”Question”,”name”:”How do you cache autocomplete responses at scale?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Two-level cache: (1) Global prefix cache: Redis sorted set per prefix, key=suggest:{prefix}, members=phrases, scores=ranking score, TTL=60s. Covers 80%+ of traffic since popular prefixes repeat constantly. (2) Per-user personalized cache: Redis hash per user, key=usersuggest:{user_id}:{prefix}, value=JSON list, TTL=5 minutes. Only cache if the user has enough query history to make personalization meaningful. CDN-level caching is not practical for personalized results. For cold-start prefixes not in cache, query the trie service directly and populate the cache. Evict stale entries on trie rebuild.”}},{“@type”:”Question”,”name”:”How do you update the trie with new queries in real time?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Do not update the trie on every query – at 10K QPS that would cause write contention. Instead: stream all search queries to Kafka. A batch consumer aggregates query counts over 5-minute windows and writes to a QueryFrequency table. A scheduler runs a trie rebuild job every hour: reads all phrases above a frequency threshold, computes updated scores, and builds a new trie in memory. The new trie is atomically swapped with the live trie (pointer swap, old trie garbage collected). For truly trending queries (spike detection), a real-time path runs every 5 minutes on a hot-phrases subset, patching only the changed nodes rather than full rebuild.”}},{“@type”:”Question”,”name”:”How do you handle typeahead search at 10,000 QPS with sub-100ms latency?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Architecture: stateless suggestion API servers behind a load balancer. Each server holds the full trie shard for its assigned prefix range in memory. Request routing: hash the prefix to determine the shard. Response: trie lookup is O(L) where L is prefix length – typically <10ms. Bottleneck is network + serialization. Optimizations: keep the trie in memory (not disk), use a lightweight serialization format (MessagePack, not JSON) for internal service calls, return only phrase strings and scores (no metadata). Deploy multiple replicas of each shard for redundancy. Total infrastructure: 8 shards x 3 replicas = 24 trie servers, each holding ~60MB of trie data. Load balancer routes by prefix hash to the correct shard.”}}]}

Google system design interviews cover search autocomplete at scale. See common questions for Google interview: search autocomplete and typeahead system design.

LinkedIn system design covers typeahead and people search. Review design patterns for LinkedIn interview: typeahead search and people search system design.

Uber system design includes location autocomplete and search. See design patterns for Uber interview: location search and autocomplete system design.