Low Level Design: Search Typeahead System

Search typeahead (autocomplete) is one of the most latency-sensitive features in any search product. Users expect suggestions to appear within 100ms of each keystroke. At Google or Bing scale, that means serving billions of prefix queries per day, updating suggestions in near-real-time as query trends shift, and personalizing results per user — all while maintaining sub-100ms p99 latency. This post walks through the complete system design, from data structures to distributed serving.

Requirements

Functional requirements: given a prefix string typed by a user, return the top K (typically 5–10) query suggestions ranked by relevance (primarily global query frequency, optionally blended with personalization). Suggestions must reflect recent query trends — a viral news story should surface in suggestions within minutes to hours. Non-functional requirements: p99 latency under 100ms at the client (which means under 50ms at the server, leaving budget for network round-trip). Availability must be extremely high — typeahead failure degrades UX but shouldn’t block search. The system must scale to tens of thousands of requests per second per datacenter. Write path (ingesting new queries) can tolerate much higher latency than the read path — seconds to minutes of lag is acceptable.

Trie with Cached Top-K

The naive data structure for prefix lookup is a trie: a tree where each edge represents a character, each node represents a prefix, and leaf nodes represent complete queries. Traversal to a node gives all queries with that prefix. The problem: to find top-K suggestions, you’d traverse the entire subtree below a node to find the highest-frequency queries — O(subtree size) per query. At the root, that’s the entire dataset. The optimization: store the top-K suggestions directly at each node. Any prefix lookup becomes O(prefix length) — just traverse down the trie and return the cached list at the destination node. The trade-off: updates are expensive. Inserting a new query with high frequency requires updating the top-K list at every ancestor node up to the root. With millions of unique queries, continuous in-place updates are prohibitively expensive. This is why the trie is rebuilt from batch aggregations rather than updated incrementally.

Offline Aggregation Pipeline

The foundation of suggestion rankings is a batch aggregation pipeline. Raw query logs are written to a distributed log (Kafka) and periodically dumped to object storage (S3/GCS). A Hadoop or Spark job runs daily (or every few hours) over the trailing 7-day window of query logs, computing query frequency counts. The job then generates the prefix-to-top-K mapping: for every prefix of every query, record the query as a candidate suggestion with its frequency weight. After aggregating across all queries, each prefix has a ranked list of top-K candidates. This mapping is serialized and pushed to the serving layer. The key insight: the batch job doesn’t need to build an in-memory trie — it can produce a flat prefix→top-K dictionary that’s more amenable to distributed storage and fast key lookup. The trie is conceptually useful for understanding the problem, but flat key-value storage is often more practical at scale.

Data Serving Layer

The prefix→top-K mapping from the batch pipeline is loaded into a distributed cache or key-value store for low-latency serving. Redis is a natural choice: the entire mapping for most sites fits in memory (a few GB), prefix lookups are O(1) hash table lookups, and Redis Cluster provides horizontal scaling. For larger deployments, custom in-memory stores backed by a persistent layer (RocksDB or similar) provide more control. The serving layer is read-heavy and read-optimized: updates happen periodically (hourly or daily from the batch pipeline), while reads happen thousands of times per second. Replication factor should be high (3–5) to distribute read load. Geographic replication (active-active across datacenters) keeps latency low for global users. Updates are pushed to all replicas via a change propagation mechanism — either a pub/sub system or a versioned snapshot load process.

Real-Time Updates

Batch pipelines have latency measured in hours. For trending queries (breaking news, viral events), this is too slow. Real-time updates blend short-window signals into the rankings. The architecture: a stream processor (Flink, Spark Streaming, or a custom service) consumes the query log from Kafka in near-real-time, computing query frequencies over a short trailing window (e.g., last 15 minutes). These short-window counts are used to compute frequency boosts for trending queries. At serving time, the top-K from the offline pipeline is blended with trending boosts: suggestions that are surging in the short window get promoted in the rankings. This two-signal approach prevents trending noise from completely overriding stable long-term rankings while still surfacing genuinely trending queries quickly. The blend weight can be tuned — heavier weight on trending during news events, lighter weight for stable query patterns.

Prefix Caching Strategy

Not all prefixes are equally common. Query length distributions are heavily skewed: most queries are short, so short prefixes (1–5 characters) receive the vast majority of traffic. Prefixes of length 1 and 2 collectively cover a huge fraction of all typeahead requests. The caching strategy: pre-compute and cache all prefixes up to length L (e.g., L = 5 or L = 8). These cover ~90–95% of traffic. Longer prefixes are served via live trie traversal or a less-aggressively-cached path. CDN-level caching is effective for short prefixes: the top-10 suggestions for "ap" or "the" don’t change minute-to-minute, so they can be cached at edge nodes with a TTL of a few minutes. This pushes a large fraction of typeahead traffic to the CDN layer, dramatically reducing load on origin servers. Cache invalidation is managed via TTL expiry — stale suggestions for a few minutes are acceptable.

Sharding by Prefix

The prefix→top-K dataset must be sharded across multiple nodes at scale. The natural sharding key is the prefix itself. A common approach: shard by the first N characters of the prefix (e.g., N = 2 or N = 3). All prefixes starting with "ap" go to shard A, all prefixes starting with "ba" go to shard B, etc. This gives predictable routing — the typeahead service hashes the first N characters of the query to determine which shard to query. Hot shard problem: some character prefixes are far more popular than others ("th" vs. "xz"). Mitigation: use consistent hashing rather than range-based sharding so load is more evenly distributed, or use virtual nodes to over-partition and rebalance. Read replicas behind each shard handle read-heavy traffic. Shard failures degrade suggestions for the affected prefix range but don’t take down the entire service.

Personalization

Global top-K rankings ignore individual user context. A user who frequently searches for Python documentation should see Python-related suggestions when they type "py", not just the globally most common "py"-prefixed queries. Personalization blends global rankings with user-specific signals. User query history is stored per-user (in a user session store or a dedicated user signal service). At serving time, for a given prefix, candidates from the user’s query history that match the prefix are scored and blended with global candidates. The blend can be simple: boost user history matches by a fixed multiplier, re-rank, take top-K from the combined list. More sophisticated approaches use a lightweight ML model trained to predict CTR given (user, prefix, candidate) features. Privacy consideration: query history is sensitive. Most systems store only a bounded history (e.g., last 30 days), apply on-device processing where possible, and provide explicit history clearing.

Spell Correction and Fuzzy Matching

Users make typos. A typeahead system that only handles exact prefix matches provides poor suggestions for mistyped queries. Spell correction approaches: (1) Edit distance: for a typed prefix that doesn’t match any known prefix, find the closest valid prefix by Levenshtein distance. Expensive for large dictionaries — BK-trees or symspell provide efficient nearest-neighbor search. (2) Phonetic matching: normalize the typed prefix phonetically (Soundex, Metaphone) and look up phonetically similar known prefixes. (3) Learned correction: train a sequence-to-sequence model on (mistyped prefix, correct prefix) pairs from query logs — queries that were typed one way and then corrected give natural training signal. In practice, most production systems handle the most common typos with a curated correction dictionary and fall back to Levenshtein distance for unknown typos. Spell correction candidates are ranked alongside normal prefix candidates and may be labeled with "Did you mean?" UI treatment.

Client-Side Optimizations

The client plays a critical role in typeahead performance and UX. Debouncing: don’t send a request on every keystroke. Wait for a short pause (e.g., 100–200ms) before firing the request. This reduces request volume by 3–5x without meaningfully degrading perceived responsiveness. Client-side prefix cache: store responses in a local dictionary keyed by prefix. If the user types "app" and the client has already fetched suggestions for "ap", it can render those immediately while the "app" request is in flight, then update when the "app" response arrives. Prefix extension optimization: if the client has a cached response for prefix P and the user extends to P+c, and all candidates in P’s response also start with P+c, the cached response is still valid — no network request needed. Request cancellation: if the user types quickly and multiple requests are in flight, cancel older requests when a newer one is sent. This prevents stale responses from overwriting fresher results.

Scroll to Top