Search Suggest System Design Overview
Search suggest (autocomplete) is one of the most latency-sensitive features in any product. Users expect candidate completions to appear within 100ms of every keystroke. At scale, the system must serve billions of prefix queries per day, rank candidates by global frequency and personal relevance, and stay current with trending queries without requiring a full index rebuild.
Requirements
Functional Requirements
- Return up to 10 ranked completions for any prefix of 2 or more characters.
- Rank results by a blend of global query frequency and personalized user history.
- Surface trending queries that have spiked in the last hour.
- Support multiple locales with locale-specific trie indexes.
- Filter adult or banned queries before returning results.
Non-Functional Requirements
- P99 latency under 100ms end-to-end including network.
- Handle 50,000 prefix queries per second at peak.
- Index freshness: new high-frequency queries appear in suggest within 15 minutes.
- 99.99% availability; suggest degradation must never block the main search box.
Data Model
The core data structures are:
- Trie nodes: stored in Redis as hashes keyed by prefix string. Each node stores top-K completions as a sorted list of (query_string, global_score) pairs pre-computed offline.
- Query frequency table: (query_string, locale, frequency, last_updated) in Postgres, updated by a Spark aggregation job every 10 minutes.
- Personal history index: (user_id, query_string, last_searched_at, search_count) in Cassandra, written on every committed search.
- Blocklist: a Redis SET of banned query strings, refreshed every 5 minutes.
Core Algorithms
Trie Index Construction
An offline Spark job runs every 10 minutes over the raw query log. It counts query frequency per locale, applies a minimum frequency threshold (default 100 searches per day), and strips blocked terms. For each surviving query, it inserts the query into a prefix trie. Each trie node stores the top-50 completions by global_score, computed as log(frequency) * recency_decay where recency_decay halves every 7 days. The resulting trie is serialized to Redis using a pipeline of HSET commands, one per node, replacing the previous snapshot atomically via RENAME.
Frequency-Weighted Ranking
When a prefix query arrives, the service fetches the top-50 completions from the trie node for that prefix. It then re-ranks the top 50 by a blended score: 0.7 * global_score + 0.3 * personal_score. Personal score is computed from the users Cassandra history: queries matching the prefix are retrieved, scored by recency (exponential decay, half-life 14 days) and search count, and merged into the candidate list. The final top-10 are returned.
Trending Query Injection
A separate trending pipeline counts query frequency in 5-minute tumbling windows using a Redis ZINCRBY counter per (prefix, query) pair. Queries whose frequency in the current window exceeds 3x their rolling 24-hour baseline are flagged as trending. The suggest service reads the trending set for the current prefix and injects trending results at positions 2 and 3, displacing lower-ranked global candidates. Trending flags expire after 60 minutes.
Scalability
The trie is partitioned by locale. Each locale trie fits in a Redis cluster with 32 shards; prefix lookups are routed by consistent hash on locale. Read replicas absorb the bulk of lookup traffic. The suggest API tier is stateless and scales horizontally behind a load balancer. Browser-side caching (Cache-Control: max-age=5) handles repeated identical prefixes within a single search session, reducing origin hits by roughly 40%.
A warm-up job pre-populates the CDN edge cache for the top 10,000 prefixes per locale every time the trie is rebuilt, ensuring cold-start latency spikes do not reach the origin.
API Design
GET /v1/suggest
- Query params: q (prefix, required), locale (default en-US), limit (default 10, max 20), session_id (for deduplication)
- Response: JSON array of suggestion objects: query_string, display_text, score, is_trending
- Headers: Cache-Control: max-age=5, Vary: Accept-Language
The endpoint returns an empty array (never an error) if the trie lookup fails, ensuring the UI degrades gracefully. A fallback path queries a pre-built static list of the top 1,000 global queries filtered by prefix as a last resort.
Sub-100ms Latency Breakdown
The latency budget is: 20ms DNS and TLS at edge, 5ms routing to origin, 10ms trie lookup from Redis, 15ms personal re-ranking from Cassandra (parallel with trie lookup), 10ms JSON serialization, 40ms remaining for network variance. Personal re-ranking runs concurrently with the trie lookup using a Go goroutine; results are merged after both complete or after a 30ms timeout, whichever comes first. If personal history times out, global ranking is used alone.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering