Search Suggest Low-Level Design: Trie Index, Personalization, and Sub-100ms Latency

Search Suggest System Design Overview

Search suggest (autocomplete) is one of the most latency-sensitive features in any product. Users expect candidate completions to appear within 100ms of every keystroke. At scale, the system must serve billions of prefix queries per day, rank candidates by global frequency and personal relevance, and stay current with trending queries without requiring a full index rebuild.

Requirements

Functional Requirements

  • Return up to 10 ranked completions for any prefix of 2 or more characters.
  • Rank results by a blend of global query frequency and personalized user history.
  • Surface trending queries that have spiked in the last hour.
  • Support multiple locales with locale-specific trie indexes.
  • Filter adult or banned queries before returning results.

Non-Functional Requirements

  • P99 latency under 100ms end-to-end including network.
  • Handle 50,000 prefix queries per second at peak.
  • Index freshness: new high-frequency queries appear in suggest within 15 minutes.
  • 99.99% availability; suggest degradation must never block the main search box.

Data Model

The core data structures are:

  • Trie nodes: stored in Redis as hashes keyed by prefix string. Each node stores top-K completions as a sorted list of (query_string, global_score) pairs pre-computed offline.
  • Query frequency table: (query_string, locale, frequency, last_updated) in Postgres, updated by a Spark aggregation job every 10 minutes.
  • Personal history index: (user_id, query_string, last_searched_at, search_count) in Cassandra, written on every committed search.
  • Blocklist: a Redis SET of banned query strings, refreshed every 5 minutes.

Core Algorithms

Trie Index Construction

An offline Spark job runs every 10 minutes over the raw query log. It counts query frequency per locale, applies a minimum frequency threshold (default 100 searches per day), and strips blocked terms. For each surviving query, it inserts the query into a prefix trie. Each trie node stores the top-50 completions by global_score, computed as log(frequency) * recency_decay where recency_decay halves every 7 days. The resulting trie is serialized to Redis using a pipeline of HSET commands, one per node, replacing the previous snapshot atomically via RENAME.

Frequency-Weighted Ranking

When a prefix query arrives, the service fetches the top-50 completions from the trie node for that prefix. It then re-ranks the top 50 by a blended score: 0.7 * global_score + 0.3 * personal_score. Personal score is computed from the users Cassandra history: queries matching the prefix are retrieved, scored by recency (exponential decay, half-life 14 days) and search count, and merged into the candidate list. The final top-10 are returned.

Trending Query Injection

A separate trending pipeline counts query frequency in 5-minute tumbling windows using a Redis ZINCRBY counter per (prefix, query) pair. Queries whose frequency in the current window exceeds 3x their rolling 24-hour baseline are flagged as trending. The suggest service reads the trending set for the current prefix and injects trending results at positions 2 and 3, displacing lower-ranked global candidates. Trending flags expire after 60 minutes.

Scalability

The trie is partitioned by locale. Each locale trie fits in a Redis cluster with 32 shards; prefix lookups are routed by consistent hash on locale. Read replicas absorb the bulk of lookup traffic. The suggest API tier is stateless and scales horizontally behind a load balancer. Browser-side caching (Cache-Control: max-age=5) handles repeated identical prefixes within a single search session, reducing origin hits by roughly 40%.

A warm-up job pre-populates the CDN edge cache for the top 10,000 prefixes per locale every time the trie is rebuilt, ensuring cold-start latency spikes do not reach the origin.

API Design

GET /v1/suggest

  • Query params: q (prefix, required), locale (default en-US), limit (default 10, max 20), session_id (for deduplication)
  • Response: JSON array of suggestion objects: query_string, display_text, score, is_trending
  • Headers: Cache-Control: max-age=5, Vary: Accept-Language

The endpoint returns an empty array (never an error) if the trie lookup fails, ensuring the UI degrades gracefully. A fallback path queries a pre-built static list of the top 1,000 global queries filtered by prefix as a last resort.

Sub-100ms Latency Breakdown

The latency budget is: 20ms DNS and TLS at edge, 5ms routing to origin, 10ms trie lookup from Redis, 15ms personal re-ranking from Cassandra (parallel with trie lookup), 10ms JSON serialization, 40ms remaining for network variance. Personal re-ranking runs concurrently with the trie lookup using a Go goroutine; results are merged after both complete or after a 30ms timeout, whichever comes first. If personal history times out, global ranking is used alone.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What trie index structure powers search suggestions?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A compressed trie (radix tree) stores all indexed terms, where each node represents a shared prefix. Each node or leaf stores an ordered list of (score, suggestion) pairs, capped at the top-K results for that prefix. This allows O(P) lookup where P is the length of the prefix typed by the user. The trie is serialized compactly and loaded into memory on suggestion nodes for microsecond lookups.”
}
},
{
“@type”: “Question”,
“name”: “How are suggestions ranked by frequency?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each suggestion carries a weight derived from historical query frequency, optionally decayed by recency (exponential time-decay so older queries contribute less). During trie construction, suggestions at each prefix node are sorted by descending weight and only the top-K are retained. Frequency counts are aggregated offline from query logs via a MapReduce or streaming pipeline and used to rebuild or incrementally update the trie.”
}
},
{
“@type”: “Question”,
“name”: “How do you personalize search suggestions for individual users?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Personalization blends global frequency scores with a per-user signal derived from the user's own query history and click-through data. At query time, the base top-K suggestions from the trie are re-ranked using a lightweight model (e.g., a dot-product score between a user embedding and suggestion embeddings stored in a low-dimensional space). Results are merged and the final list is returned, balancing popularity with personal relevance.”
}
},
{
“@type”: “Question”,
“name”: “How do you achieve sub-100ms latency for search suggestions?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The trie index lives entirely in memory on each suggestion server, eliminating disk I/O. Requests are served from a CDN edge cache for the most popular prefixes. For cache misses, the backend responds in under 10ms due to the in-memory trie. Client-side debouncing (150–200ms) reduces request rate, and a dedicated low-latency network path (keep-alive HTTP/2 or WebSocket) minimizes round-trip overhead to stay comfortably under 100ms end-to-end.”
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

Scroll to Top