Search autocomplete (typeahead) suggests query completions as the user types, reducing keystrokes and guiding users toward popular or relevant searches. Google processes billions of autocomplete requests per day with sub-50ms latency. This guide covers the architecture of a production autocomplete system — from data collection to real-time suggestion serving — a frequently asked system design interview question.
Requirements and Scale
Functional: as the user types each character, return the top 5-10 matching suggestions ranked by relevance. Support prefix matching (typing “how to” returns “how to tie a tie”, “how to cook rice”). Non-functional: latency under 50ms (suggestions must appear faster than the user types), high availability (autocomplete is on the critical path of every search), and scale to handle 100,000+ requests per second. Data sources: (1) Historical query logs — the most searched queries weighted by frequency. “weather” is suggested before “weatherization” because millions more people search for it. (2) Trending queries — recently spiking queries (breaking news, viral events). “election results” surges on election day. (3) Personalized history — the user own past searches. (4) Entity data — product names, city names, celebrity names from a knowledge base. Estimation: 5 billion searches per day, average 4 characters typed before selecting a suggestion = 20 billion autocomplete requests per day = 230,000 requests per second. Each request must return in under 50ms.
Trie-Based Approach
A trie (prefix tree) is the natural data structure for prefix matching. Each node represents a character; the path from root to a node represents a prefix. Store the top-K suggestions at each node (pre-computed). Lookup: traverse the trie following the input prefix. At the final node, return the stored top-K suggestions. Time: O(L) where L is the prefix length — independent of the total number of queries. Building the trie: (1) Aggregate query logs to compute frequency per query. (2) Insert each query into the trie. (3) At each node, compute the top-K suggestions from all queries passing through that node (using a heap or sorting). Store these suggestions at the node. Updating: do not rebuild the entire trie for every new query. Use offline batch updates: periodically (hourly or daily) rebuild the trie from updated query logs. For trending queries, maintain a separate small trie updated in real-time. Merge results from both tries at query time. Memory: a trie for 100 million unique queries uses approximately 10-50 GB depending on implementation. Fit in memory on a single high-memory server or shard by prefix (a-m on shard 1, n-z on shard 2).
Elasticsearch-Based Approach
For simpler implementation, use Elasticsearch with a completion suggester or edge ngram analyzer. Completion suggester: Elasticsearch builds an in-memory FST (Finite State Transducer) optimized for prefix lookups. Index queries with their weights. At query time, the completion suggester returns the top-K matches in under 5ms. Edge ngram approach: index each query with an edge ngram analyzer that tokenizes “how to cook” into [“h”, “ho”, “how”, “how “, “how t”, “how to”, …]. A prefix search becomes a standard term query. Less memory-efficient than the completion suggester but supports fuzzy matching and more complex scoring. Advantages over a custom trie: Elasticsearch handles distribution, replication, and failover. No custom sharding logic. Built-in relevance scoring (BM25 + custom boosting). Easy to update (index new documents). Disadvantage: slightly higher latency than an in-memory trie (10-30ms vs 1-5ms). For most applications, Elasticsearch is sufficient. Build a custom trie only when you need sub-5ms latency at extreme scale (Google, Bing).
Ranking Suggestions
Not all prefix matches are equally relevant. Ranking factors: (1) Query frequency — more popular queries rank higher. “python tutorial” ranks above “python terrarium” for prefix “python t”. (2) Recency — recently trending queries get a boost. Use an exponential decay: score = frequency * decay^(days_since_last_search). Recent queries have higher scores. (3) Personalization — the user own search history. If the user frequently searches for “python pandas,” boost “python pandas” for prefix “python p” for that specific user. Store recent queries per user in Redis. (4) Freshness for trending — detect query frequency spikes using a sliding window. If “earthquake” searches increased 10x in the last hour, boost it. (5) Query quality — filter out offensive, misspelled, or low-quality queries. Maintain a blocklist. Run queries through a spell-checker before adding to the suggestion index. Combined score: base_score = log(frequency) * recency_decay + trending_boost + personalization_boost. Return the top-K by combined score. The ranking model can be as simple as the formula above or as complex as an ML model trained on click-through data (which suggestion did the user select?).
Client-Side Optimization
The client (browser or mobile app) must minimize requests and latency: (1) Debouncing — do not send a request for every keystroke. Wait 100-200ms after the user stops typing before sending the request. If the user types “how” in 150ms, send one request for “how” instead of three requests for “h”, “ho”, “how”. This reduces request volume by 60-80%. (2) Client-side caching — cache suggestions by prefix. If the user types “how” and gets suggestions, then types “how t”, the client can filter the cached “how” results locally for prefixes that start with “how t” (if the cached results include them). Only send a new request if the cached results are insufficient. (3) Pre-fetching — after receiving suggestions for “how”, pre-fetch suggestions for “how ” (with a space) in the background. When the user types the space, suggestions appear instantly. (4) Request cancellation — if the user types faster than the response arrives, cancel the in-flight request and send a new one for the latest prefix. Stale responses are useless. (5) Minimum prefix length — do not send autocomplete requests for prefixes shorter than 2-3 characters. Single-character prefixes return too many generic suggestions and waste server resources.
System Architecture
Components: (1) Query collection service — logs every search query with timestamp and user ID to Kafka. (2) Aggregation pipeline — a Spark/Flink job aggregates query logs: computes per-query frequency, detects trending queries, and builds the suggestion dataset. Runs hourly for frequency updates, every 5 minutes for trending detection. (3) Suggestion index — the trie or Elasticsearch index serving autocomplete queries. Updated from the aggregation pipeline output. For a trie: build a new trie offline and swap atomically (blue-green). For Elasticsearch: index new documents incrementally. (4) Serving layer — stateless API servers that receive prefix queries, look up the suggestion index, apply personalization (merge with user recent queries from Redis), and return ranked suggestions. (5) CDN/edge caching — cache popular prefix responses at the CDN edge. “how to” is searched millions of times; cache the response. Personalized suggestions cannot be cached at CDN (they differ per user), but non-personalized base suggestions can. Sharding: shard the suggestion index by prefix range. Route requests to the correct shard based on the first 1-2 characters. Each shard handles a subset of the alphabet.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does a trie-based autocomplete system work?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A trie (prefix tree) stores all searchable queries where each node represents a character and paths from root represent prefixes. For autocomplete: pre-compute and store the top-K suggestions at each trie node based on query frequency. When the user types a prefix, traverse the trie following the prefix characters. At the final node, return the stored top-K suggestions. Lookup is O(L) where L is the prefix length — independent of total queries. Building: aggregate query logs for frequency, insert queries into the trie, compute top-K at each node using a heap. Updates: rebuild the trie periodically (hourly) from updated logs. For trending queries, maintain a separate small real-time trie and merge results at query time. Memory: 100 million unique queries use approximately 10-50 GB. Shard by prefix range (a-m on shard 1, n-z on shard 2) for larger datasets.”}},{“@type”:”Question”,”name”:”How do you rank autocomplete suggestions?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Multiple ranking signals combined: (1) Query frequency — more popular queries rank higher. python tutorial over python terrarium. (2) Recency — exponential decay: score = frequency * decay^(days_since_last_search). Recent queries score higher. (3) Trending — detect frequency spikes using sliding windows. If earthquake searches increased 10x in the last hour, boost it. (4) Personalization — the user own search history from Redis. If a user frequently searches python pandas, boost it for that user. (5) Query quality — filter offensive, misspelled, or low-quality queries via blocklist and spell-checker. Combined score: log(frequency) * recency_decay + trending_boost + personalization_boost. Return top-K by combined score. Advanced: train an ML model on click-through data (which suggestion did users select?) to learn optimal ranking weights.”}},{“@type”:”Question”,”name”:”How does client-side debouncing reduce autocomplete request volume?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Without debouncing, every keystroke triggers an API request. Typing how in 150ms sends 3 requests (h, ho, how). Debouncing waits 100-200ms after the last keystroke before sending a request. If the user types how in 150ms, only one request for how is sent. This reduces request volume by 60-80%. Additional client optimizations: (1) Cache suggestions by prefix. If how returns suggestions, filter cached results locally for how t before making a new request. (2) Pre-fetch the next likely prefix (e.g., how with a trailing space) in the background. (3) Cancel in-flight requests when the user types faster than responses arrive — stale responses are useless. (4) Minimum prefix length of 2-3 characters — single-character prefixes return too many generic results. These optimizations reduce server load dramatically while improving perceived responsiveness.”}},{“@type”:”Question”,”name”:”Should you use a custom trie or Elasticsearch for autocomplete?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Elasticsearch completion suggester: builds an in-memory FST optimized for prefix lookups. Returns top-K matches in under 5ms. Handles distribution, replication, failover, and relevance scoring. Easy to update (index new documents). Edge ngram approach in Elasticsearch supports fuzzy matching. Custom trie: sub-5ms latency (1-2ms typical). Maximum control over ranking and personalization. Requires custom sharding, replication, and deployment infrastructure. Recommendation: use Elasticsearch for most applications. It provides sufficient performance (10-30ms) with much less operational complexity. Build a custom trie only at extreme scale (Google, Bing) where sub-5ms latency and custom ranking logic justify the engineering investment. Many production autocomplete systems at mid-size companies use Elasticsearch successfully.”}}]}