Low Level Design: Search Autocomplete Deep Dive

Search autocomplete is a latency-critical, high-read-volume system that combines data structures, probabilistic algorithms, and multi-layer caching to return relevant suggestions within 200ms of a keypress. This design goes deeper than surface-level trie discussion into the production engineering that makes it work.

Prefix Tree at Scale

A trie is the canonical data structure for prefix lookup, but naive in-memory tries don’t scale to billions of queries. The production approach: each trie node stores the top-K (e.g., top-100) suggestions with their scores rather than the full subtree. This caps memory per node and makes retrieval O(prefix_length) instead of requiring traversal of the full subtree. The trie is serialized as a compact binary format (not JSON) — each node encoded as a fixed-size record with child pointers and an offset into a suggestion list array. The entire structure is memory-mapped, enabling fast random access without deserialization overhead. Sharding by 2-character prefix distributes the trie across machines: queries for "ap…" route to the shard owning the "ap" prefix range. Each shard is served from a dedicated in-memory cache (Redis or a custom service), with the full trie snapshot loaded on startup from object storage and updated incrementally as query data arrives.

Real-Time Trending

Trending queries must be detected in near-real-time and promoted above baseline frequency ranking. A Count-Min Sketch (CMS) tracks query frequency in a sliding 1-hour window with O(1) update and O(1) query time, using a fixed memory footprint regardless of query volume. The CMS is a 2D array of counters; each query hashes to one cell per row and increments it. Frequency estimate is the minimum across all rows (hence "min"), which bounds overcounting from hash collisions. Query events are consumed from a Kafka topic by a streaming processor (Flink or Spark Streaming). The top-K trending queries are maintained in a min-heap of size K, updated on each event: if the estimated frequency exceeds the heap minimum, the old minimum is evicted and the new query inserted. The resulting trending list is pushed to the autocomplete serving layer every 30–60 seconds, where trending queries receive a score boost that can surface them above higher-frequency but stale suggestions.

Spell Correction

Spell correction uses Levenshtein edit distance to find the closest valid query in the dictionary. The DP recurrence computes edit distance between two strings in O(m*n) time where m and n are string lengths. Running this against a full dictionary on every keystroke is infeasible; the approach is to first retrieve a small candidate set and then compute exact distance only against candidates. A BK-tree (Burkhard-Keller tree) indexes the dictionary using edit distance as the metric; queries for all terms within edit distance 2 execute in O(log N) average time. SymSpell is an alternative: pre-compute all deletions up to max edit distance for every dictionary term (stored in a hash map), then look up deletions of the query to find candidates in near-O(1). In practice, restricting to edit distance 1–2 covers the vast majority of typos while keeping candidate sets small. The corrected query is presented as a "Did you mean X?" suggestion ranked below exact prefix matches but above no-result fallbacks.

Query Segmentation

Multi-word queries require segmentation before trie lookup. The input string is tokenized and each token is matched independently in the trie, then suggestions are combined and re-ranked. The challenge is entity recognition: "new york times" should be treated as a single entity, not three separate tokens. An n-gram language model trained on the query log scores candidate segment boundaries — it assigns probability to sequences of tokens, and the segmentation that maximizes joint probability is selected using dynamic programming. Common named entities (cities, brands, people) are pre-loaded into an entity dictionary that overrides the language model when a known entity is detected. This allows "new york" to be looked up as a single prefix entry in the trie while still segmenting ambiguous inputs correctly.

Contextual Ranking

Raw frequency ranking produces globally popular suggestions, but relevance improves significantly with context signals. Location bias: a user in Seattle searching "pike" should see "pike place market" above "pike river." City-level location (from IP geolocation or GPS) re-weights suggestions associated with local entities. Session context: terms searched earlier in the same session are treated as evidence of intent — if the user searched "running shoes" then starts typing "ad," "adidas" ranks above "adobe." This is implemented as a session-scoped score multiplier applied on top of the base suggestion scores. Category context: if the user is browsing a shoe category page, suggestions from the shoe query corpus are boosted. All contextual signals are applied as multiplicative score adjustments server-side, after retrieving the base candidate set from cache, keeping the cache layer context-free and shareable across users.

Mobile vs Desktop Differences

Mobile autocomplete differs from desktop in several ways driven by UX constraints and network conditions. Suggestion list size: 5 items on mobile (limited screen space) vs 10 on desktop. Touch targets must be larger (minimum 44px height per Apple HIG), requiring a different rendering component. Voice input integration: mobile surfaces a microphone icon that activates speech-to-text; the transcribed query is sent through the same autocomplete pipeline. For offline support or high-latency networks, an on-device ML model serves suggestions using a quantized trie compressed to fit within a 5–10 MB model file. The on-device model covers the top-100K queries (which represent ~80% of query volume); long-tail queries fall back to server-side lookup when connectivity is available. Debounce interval on mobile is slightly longer (200ms vs 150ms on desktop) to account for the higher cost of mobile network round trips.

Caching Strategy

The caching architecture has three layers. The prefix cache is the core: a Redis ZSET (sorted set) per 2-character prefix stores the top-100 suggestion candidates with their scores as the sort key. TTL is 5 minutes; trending updates invalidate and repopulate affected prefix keys. The CDN cache sits in front of the autocomplete API for anonymous (non-personalized) requests: popular prefix responses (e.g., "th," "wh," "ho") are cached at the edge with a 1-minute TTL, serving the majority of anonymous traffic without hitting origin. The personalization layer is added server-side on top of the cached base suggestions: after retrieving the base list from Redis or CDN, the server applies user-specific score adjustments (history, location, session) and trims to the final result size. This layering keeps the expensive personalization computation out of the cache and ensures the cache hit rate remains high (>95% for the prefix layer).

Latency Budget

The p99 target from keypress to suggestions rendered is 200ms. Breaking this down: client-side debounce absorbs 150ms (requests only fire after 150ms of no new input). DNS + TCP + TLS handshake costs ~50ms on the first request but drops to ~0ms on subsequent requests over the same keep-alive HTTP/2 connection (which persists for the session duration). Server processing — prefix cache lookup, contextual re-ranking, serialization — targets 10ms at p99. Network round-trip on a good connection is 10–30ms. Rendering the suggestion dropdown is < 5ms with a pre-built DOM template. The math: 0ms (connection reuse) + 10ms (server) + 20ms (RTT) + 5ms (render) = 35ms from request send to suggestions visible, well under the 200ms budget. The debounce is the dominant term, intentionally so — it prevents request storms on fast typists and gives the network layer time to batch.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “How does Count-Min Sketch enable real-time trending query detection at scale?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Count-Min Sketch is a probabilistic data structure that estimates item frequencies using a fixed-size 2D array of counters and multiple independent hash functions. Each incoming query is hashed with each function and the corresponding counter is incremented; the frequency estimate is the minimum across all rows. For trending detection, a sliding window variant (decaying counts or a pair of sketches representing current vs. previous window) lets you compare estimated frequencies over time. The structure uses O(w u00d7 d) memory regardless of the query vocabulary size, making it practical for billions of queries per day. The tradeoff is a one-sided overcount error bounded by u03b5 with probability 1u2212u03b4, which is acceptable when you need approximate top-K trending queries rather than exact counts.” } }, { “@type”: “Question”, “name”: “When should you use edit-distance spell correction versus SymSpell for autocomplete?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Classic edit-distance (Levenshtein/BK-tree) computes corrections by comparing the input against a dictionary, pruning branches once the edit distance exceeds a threshold. It is straightforward to implement but scales as O(n u00d7 L) per query where n is dictionary size. SymSpell pre-generates all deletions up to edit distance k at index time and stores them in a hash map, reducing lookup to O(1) average case at query time with a larger memory footprint. For autocomplete services with strict sub-50ms latency requirements and dictionaries in the millions of terms, SymSpell is the practical choice. Edit-distance trees remain useful when memory is constrained or when you need exact ranked results rather than fast approximate correction.” } }, { “@type”: “Question”, “name”: “What contextual ranking signals improve autocomplete personalization?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Beyond global query frequency, effective personalization layers in signals such as the user’s past search history and click-through patterns, session context (queries issued earlier in the same session), geographic locale (city-level trending queries), device type, time-of-day patterns, and entity affinity derived from the user’s interaction graph. These signals are typically combined using a lightweight learning-to-rank model (e.g., LambdaMART or a small neural ranker) that re-scores the top-N candidates retrieved by the prefix trie or inverted index. The model must run within a few milliseconds, so features are pre-computed and cached in a low-latency store such as Redis rather than computed on the fly.” } }, { “@type”: “Question”, “name”: “How do mobile autocomplete constraints differ from desktop, and how do you design for both?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Mobile keyboards generate intermediate keystrokes faster (swipe-to-type) but also produce more typos and partial inputs, requiring more aggressive spell correction and prefix matching. Network latency and bandwidth are less predictable on mobile, so debounce windows are typically longer (150-300ms vs. 50-100ms on desktop) and payloads are kept minimal—often just titles and IDs with icons fetched separately. Touch targets require fewer suggestions (4-5 vs. 8-10) with larger hit areas. On the backend, mobile clients may send compressed requests and receive delta updates rather than full suggestion lists on every keystroke. Designing a single API that returns ranked candidates with a configurable limit and supports both full and incremental response modes covers both surfaces cleanly.” } }, { “@type”: “Question”, “name”: “How do you break down the latency budget to achieve sub-200ms autocomplete end-to-end?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “A typical sub-200ms budget allocates roughly: 0-20ms client-side debounce and request serialization; 10-30ms network RTT to a regional edge or CDN PoP; 5-15ms edge cache lookup (for popular prefixes); 20-50ms backend prefix retrieval from trie or inverted index; 10-20ms ranking and personalization scoring; 5-10ms serialization and response; 10-30ms network return and browser rendering. The largest levers are prefix caching at the edge (serving the top-1000 prefixes from cache eliminates most backend calls), co-locating the suggestion service with the user’s region, and keeping ranking models small enough to run in under 10ms. Continuous p99 latency monitoring with per-stage tracing is essential to catch regressions before they affect users.” } } ] }

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

See also: Coinbase Interview Guide

Scroll to Top