Search autocomplete is one of the most latency-sensitive features in any product. Users expect suggestions within 100ms of each keystroke. This post walks through the full low-level design: data models, trie vs Redis sorted sets, aggregation pipelines, debouncing, and ranking signals.
Requirements and Scale
- 10M DAU, peak ~5K queries per second
- p99 suggestion latency: 100ms end-to-end including network
- Return top 5 suggestions per prefix
- New trending queries must surface within 1 hour
- Handle 26+ languages, profanity filtering, geo-specific results
At 10M DAU with average 5 searches per user per day, that is 50M queries/day or ~580 QPS sustained with 3-5x peak bursts.
Data Model
queries table (persistent store)
CREATE TABLE queries (
query_text VARCHAR(255) PRIMARY KEY,
count BIGINT DEFAULT 0,
last_seen TIMESTAMP,
INDEX idx_count (count DESC)
);
prefix_suggestions cache (Redis)
One sorted set per prefix, score = ranking signal, member = query text. Keys look like pfx:sea, pfx:sear, pfx:searc, etc.
Trie with Top-K per Node
A classic trie stores characters at each node. To support fast top-K retrieval, each node also caches the top-5 query completions for that prefix – sorted by score.
class TrieNode:
def __init__(self):
self.children = {}
self.top_k = [] # sorted list of (score, query_text), max 5 entries
class AutocompleteTrie:
def __init__(self, k=5):
self.root = TrieNode()
self.k = k
def insert(self, query: str, score: float):
node = self.root
for char in query:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
self._update_top_k(node, query, score)
def _update_top_k(self, node, query, score):
# Remove old entry for this query if present
node.top_k = [(s, q) for s, q in node.top_k if q != query]
node.top_k.append((score, query))
node.top_k.sort(reverse=True)
node.top_k = node.top_k[:self.k]
def search(self, prefix: str):
node = self.root
for char in prefix:
if char not in node.children:
return []
node = node.children[char]
return [q for _, q in node.top_k]
Why trie writes are expensive: When query “search engine” gets count 1001, you must update every ancestor node from root to the leaf – that is O(L) node updates where L = query length (up to 100 chars). With 1B queries/day you cannot do this synchronously. The trie is rebuilt periodically from aggregated data, not updated on every query.
Redis Sorted Set Approach
Redis sorted sets give you O(log N) insert and O(log N + K) range query. For each prefix of length 1 to len(query), store the query as a member with its score.
# When query "search engine" gets score 1500:
for i in range(1, len("search engine") + 1):
prefix = "search engine"[:i]
redis.zadd(f"pfx:{prefix}", {"search engine": 1500})
# Retrieve top 5 for prefix "sea":
results = redis.zrevrange("pfx:sea", 0, 4)
# Increment score (for incremental updates):
redis.zincrby("pfx:sea", 1, "search engine")
# Set TTL to avoid stale keys for obscure prefixes:
redis.expire("pfx:sea", 86400 * 7) # 7 days
Memory estimate: Average query = 10 chars = 10 prefixes. Top 10K queries contribute 100K prefix-member entries. Each entry ~50 bytes. Total ~5MB for top 10K queries – very manageable. At top 1M queries: ~500MB, still fine for a Redis instance.
ZINCRBY vs ZADD: Use ZADD NX to add new entries and ZINCRBY for incremental updates from the aggregation pipeline. Never call ZINCRBY on every raw query – batch first.
Query Aggregation Pipeline
Direct write-per-query to Redis would be 580 ZINCRBY calls per second per prefix, multiplied by 10 chars average = 5,800 Redis writes/sec. Manageable but unnecessary. Batching reduces load and smooths spikes.
# Simplified aggregation pipeline (runs every 5-15 minutes)
def aggregate_and_update(log_lines):
from collections import Counter
import math, time
counts = Counter()
for line in log_lines:
query = line['query'].lower().strip()
if 2 <= len(query) <= 100:
counts[query] += 1
for query, delta in counts.items():
# Update persistent store
db.execute("""
INSERT INTO queries (query_text, count, last_seen)
VALUES (%s, %s, NOW())
ON DUPLICATE KEY UPDATE
count = count + %s,
last_seen = NOW()
""", (query, delta, delta))
# Recalculate score from DB
row = db.fetchone("SELECT count, last_seen FROM queries WHERE query_text=%s", (query,))
age_hours = (time.time() - row['last_seen'].timestamp()) / 3600
score = row['count'] * math.exp(-age_hours / 168) # 7-day half-life
# Update all prefixes in Redis
pipe = redis.pipeline()
for i in range(1, len(query) + 1):
prefix = query[:i]
pipe.zadd(f"pfx:{prefix}", {query: score})
pipe.expire(f"pfx:{prefix}", 86400 * 7)
pipe.execute()
For trending queries (must surface within 1 hour), run a separate high-frequency pipeline every 5 minutes on the last 5 minutes of query logs.
Frontend Debouncing
Without debouncing, a user typing “search” fires 6 requests: “s”, “se”, “sea”, “sear”, “searc”, “search”. Debouncing waits until keystrokes pause before sending.
// JavaScript debounce for typeahead
let debounceTimer = null;
let activeController = null;
function onInput(value) {
clearTimeout(debounceTimer);
if (activeController) {
activeController.abort(); // cancel in-flight request
}
debounceTimer = setTimeout(() => {
if (value.length r.json())
.then(renderSuggestions)
.catch(e => { if (e.name !== 'AbortError') console.error(e); });
}, 200); // 150-300ms is the sweet spot
}
200ms debounce reduces requests by ~70% for average typists (5 chars/sec). For slow typists it fires on almost every character; for fast typists it fires only once or twice per word.
Ranking Signal
Pure frequency ranking surfaces stale queries. A recency-weighted score handles trending topics:
import math
def compute_score(count: int, last_seen_hours_ago: float) -> float:
# 7-day half-life: a query from 168 hours ago has half the weight
recency_weight = math.exp(-last_seen_hours_ago / 168)
return count * recency_weight
# Personalization bonus: if user has searched this query before
def compute_personalized_score(base_score, user_query_count):
return base_score * (1 + 0.3 * min(user_query_count, 5))
Personalization is applied client-side or in a thin personalization layer, not in the shared Redis cache. The shared cache stores the global top-5; the client re-ranks by mixing in personal history.
Filtering
- Block list: A Redis set
blocked_querieschecked before returning results. Managed by a content moderation team. - Country filtering: Per-country prefix keys (
pfx:us:sea,pfx:de:sea) or a post-filter layer that removes geo-restricted content. - Minimum frequency threshold: Only promote queries with count > 10 to avoid surfacing typos and one-off searches.
Architecture
Client Browser
|
| (debounced HTTP GET /suggest?q=sea)
v
CDN Edge Cache (cache popular prefixes, 10s TTL)
|
| (cache miss)
v
Suggestion Service (stateless, horizontally scaled)
|
v
Redis Cluster (sorted sets per prefix)
|
| (async)
v
Query Log Kafka Topic
|
v
Batch Aggregator (runs every 5-15 min)
|
v
MySQL queries table --> Redis Cluster (score updates)
Scale Numbers
- Average query length: 10 chars = 10 prefixes generated per query
- 1B queries/day = 10B prefix entries to update per day = ~115K prefix updates/sec
- With batching: aggregate first, then update unique (query, prefix) pairs only – reduces to ~1M unique prefix updates per batch cycle
- CDN caches the top ~1000 prefixes (2-3 chars) – covers ~80% of traffic
- Redis memory for 10M unique queries at 10 prefixes each: ~5GB – one mid-size Redis node
- Read path: CDN hit – <5ms; Redis hit – <10ms; well within 100ms p99
The key insight: the read path (CDN + Redis) is simple and fast. All complexity is pushed to the write path (aggregation pipeline), which runs asynchronously and can tolerate higher latency.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you store suggestions for fast prefix lookups at scale?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use a Redis sorted set per prefix. For a query "apple" with score 1000, update sorted sets for prefixes "a", "ap", "app", "appl", "apple" using ZINCRBY prefix:{prefix} score {query}. On each keystroke, call ZREVRANGE prefix:{typed} 0 4 to get top 5. Average query length is 8-10 characters, so storing all prefixes uses 8-10x the space of storing the queries alone. At 1M unique queries, that is roughly 10M sorted set entries – manageable. Avoid rebuilding the trie on every query update: the Redis sorted set approach handles incremental updates natively. For cold start, batch-load top queries from the analytics pipeline.”}},{“@type”:”Question”,”name”:”How do you prevent stale or inappropriate suggestions?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Two mechanisms: (1) Block list: maintain a Redis set of blocked terms. Before returning suggestions to the client, filter out any blocked terms. Block list updates take effect immediately. (2) Score decay: suggestions are weighted by recency. Score = raw_count * exp(-age_in_hours / 168). After one week, a query score is at 37% of its peak. After 3 weeks, under 5%. This prevents old viral queries from permanently occupying suggestion slots. Trending detection: monitor for queries whose hourly count is more than 3x their 7-day rolling average. Flag these for expedited propagation to the suggestion cache (within 15 minutes rather than the normal hourly batch cycle).”}},{“@type”:”Question”,”name”:”How does debouncing improve autocomplete performance?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Without debouncing, each keystroke fires a server request. A user typing "algorit" at normal speed fires 7 requests in ~500ms, but only the last matters. Debouncing delays the request until the user pauses: set a timeout for 150-300ms on each keystroke, canceling the previous timeout. Only when the user pauses does the request fire. This reduces server calls by 5-10x. Implementation: use AbortController in JavaScript to cancel in-flight requests when a new keystroke arrives – prevents old responses from overwriting newer ones. The 150-300ms delay is imperceptible to users but dramatically reduces server load and race condition bugs.”}},{“@type”:”Question”,”name”:”How do you scale search autocomplete to handle 10M DAU?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Request volume: 10M users * 5 avg searches * 6 avg keystrokes = 300M suggestion requests/day = 3500 req/second. Architecture: (1) CDN for popular prefixes – cache top suggestions for high-traffic prefixes at edge nodes, TTL 60s. CDN handles 80% of requests. (2) Suggestion service reads only from Redis (no DB on hot path). (3) Redis read replicas for read scaling (Redis replication lag < 1ms). (4) Query logging: fire-and-forget async writes to Kafka for offline aggregation. Never block the suggestion response on query logging. (5) Batch aggregation job runs hourly, reads from Kafka, updates Redis sorted sets in batch using pipelining.”}},{“@type”:”Question”,”name”:”How do you implement personalized search suggestions?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Blend two signals: global suggestions (popular across all users) and personal suggestions (user's own recent searches). On each suggestion request: (1) Fetch top-5 global suggestions from Redis sorted set. (2) Fetch user's last 50 searches from a user_searches sorted set (score = search timestamp). Filter by prefix match. (3) Merge: personal matches get a 2x score boost. Return the top-5 after merging. Store user search history in a per-user Redis sorted set: ZADD user_searches:{user_id} {timestamp} {query}, capped at 200 entries. TTL on the user sorted set: 30 days of inactivity. Respect user privacy: provide a "clear search history" API that DELs the sorted set.”}}]}
Twitter search requires autocomplete at massive scale. See system design questions for Twitter/X interview: search autocomplete system design.
LinkedIn search uses typeahead and autocomplete systems. See design patterns for LinkedIn interview: typeahead and search autocomplete design.
Snap search and discovery use autocomplete features. See system design patterns for Snap interview: search and autocomplete system design.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering