Search autocomplete is a latency-critical, high-read-volume system that combines data structures, probabilistic algorithms, and multi-layer caching to return relevant suggestions within 200ms of a keypress. This design goes deeper than surface-level trie discussion into the production engineering that makes it work.
Prefix Tree at Scale
A trie is the canonical data structure for prefix lookup, but naive in-memory tries don’t scale to billions of queries. The production approach: each trie node stores the top-K (e.g., top-100) suggestions with their scores rather than the full subtree. This caps memory per node and makes retrieval O(prefix_length) instead of requiring traversal of the full subtree. The trie is serialized as a compact binary format (not JSON) — each node encoded as a fixed-size record with child pointers and an offset into a suggestion list array. The entire structure is memory-mapped, enabling fast random access without deserialization overhead. Sharding by 2-character prefix distributes the trie across machines: queries for "ap…" route to the shard owning the "ap" prefix range. Each shard is served from a dedicated in-memory cache (Redis or a custom service), with the full trie snapshot loaded on startup from object storage and updated incrementally as query data arrives.
Real-Time Trending
Trending queries must be detected in near-real-time and promoted above baseline frequency ranking. A Count-Min Sketch (CMS) tracks query frequency in a sliding 1-hour window with O(1) update and O(1) query time, using a fixed memory footprint regardless of query volume. The CMS is a 2D array of counters; each query hashes to one cell per row and increments it. Frequency estimate is the minimum across all rows (hence "min"), which bounds overcounting from hash collisions. Query events are consumed from a Kafka topic by a streaming processor (Flink or Spark Streaming). The top-K trending queries are maintained in a min-heap of size K, updated on each event: if the estimated frequency exceeds the heap minimum, the old minimum is evicted and the new query inserted. The resulting trending list is pushed to the autocomplete serving layer every 30–60 seconds, where trending queries receive a score boost that can surface them above higher-frequency but stale suggestions.
Spell Correction
Spell correction uses Levenshtein edit distance to find the closest valid query in the dictionary. The DP recurrence computes edit distance between two strings in O(m*n) time where m and n are string lengths. Running this against a full dictionary on every keystroke is infeasible; the approach is to first retrieve a small candidate set and then compute exact distance only against candidates. A BK-tree (Burkhard-Keller tree) indexes the dictionary using edit distance as the metric; queries for all terms within edit distance 2 execute in O(log N) average time. SymSpell is an alternative: pre-compute all deletions up to max edit distance for every dictionary term (stored in a hash map), then look up deletions of the query to find candidates in near-O(1). In practice, restricting to edit distance 1–2 covers the vast majority of typos while keeping candidate sets small. The corrected query is presented as a "Did you mean X?" suggestion ranked below exact prefix matches but above no-result fallbacks.
Query Segmentation
Multi-word queries require segmentation before trie lookup. The input string is tokenized and each token is matched independently in the trie, then suggestions are combined and re-ranked. The challenge is entity recognition: "new york times" should be treated as a single entity, not three separate tokens. An n-gram language model trained on the query log scores candidate segment boundaries — it assigns probability to sequences of tokens, and the segmentation that maximizes joint probability is selected using dynamic programming. Common named entities (cities, brands, people) are pre-loaded into an entity dictionary that overrides the language model when a known entity is detected. This allows "new york" to be looked up as a single prefix entry in the trie while still segmenting ambiguous inputs correctly.
Contextual Ranking
Raw frequency ranking produces globally popular suggestions, but relevance improves significantly with context signals. Location bias: a user in Seattle searching "pike" should see "pike place market" above "pike river." City-level location (from IP geolocation or GPS) re-weights suggestions associated with local entities. Session context: terms searched earlier in the same session are treated as evidence of intent — if the user searched "running shoes" then starts typing "ad," "adidas" ranks above "adobe." This is implemented as a session-scoped score multiplier applied on top of the base suggestion scores. Category context: if the user is browsing a shoe category page, suggestions from the shoe query corpus are boosted. All contextual signals are applied as multiplicative score adjustments server-side, after retrieving the base candidate set from cache, keeping the cache layer context-free and shareable across users.
Mobile vs Desktop Differences
Mobile autocomplete differs from desktop in several ways driven by UX constraints and network conditions. Suggestion list size: 5 items on mobile (limited screen space) vs 10 on desktop. Touch targets must be larger (minimum 44px height per Apple HIG), requiring a different rendering component. Voice input integration: mobile surfaces a microphone icon that activates speech-to-text; the transcribed query is sent through the same autocomplete pipeline. For offline support or high-latency networks, an on-device ML model serves suggestions using a quantized trie compressed to fit within a 5–10 MB model file. The on-device model covers the top-100K queries (which represent ~80% of query volume); long-tail queries fall back to server-side lookup when connectivity is available. Debounce interval on mobile is slightly longer (200ms vs 150ms on desktop) to account for the higher cost of mobile network round trips.
Caching Strategy
The caching architecture has three layers. The prefix cache is the core: a Redis ZSET (sorted set) per 2-character prefix stores the top-100 suggestion candidates with their scores as the sort key. TTL is 5 minutes; trending updates invalidate and repopulate affected prefix keys. The CDN cache sits in front of the autocomplete API for anonymous (non-personalized) requests: popular prefix responses (e.g., "th," "wh," "ho") are cached at the edge with a 1-minute TTL, serving the majority of anonymous traffic without hitting origin. The personalization layer is added server-side on top of the cached base suggestions: after retrieving the base list from Redis or CDN, the server applies user-specific score adjustments (history, location, session) and trims to the final result size. This layering keeps the expensive personalization computation out of the cache and ensures the cache hit rate remains high (>95% for the prefix layer).
Latency Budget
The p99 target from keypress to suggestions rendered is 200ms. Breaking this down: client-side debounce absorbs 150ms (requests only fire after 150ms of no new input). DNS + TCP + TLS handshake costs ~50ms on the first request but drops to ~0ms on subsequent requests over the same keep-alive HTTP/2 connection (which persists for the session duration). Server processing — prefix cache lookup, contextual re-ranking, serialization — targets 10ms at p99. Network round-trip on a good connection is 10–30ms. Rendering the suggestion dropdown is < 5ms with a pre-built DOM template. The math: 0ms (connection reuse) + 10ms (server) + 20ms (RTT) + 5ms (render) = 35ms from request send to suggestions visible, well under the 200ms budget. The debounce is the dominant term, intentionally so — it prevents request storms on fast typists and gives the network layer time to batch.
See also: Coinbase Interview Guide