Search Autocomplete System Low-Level Design

Search autocomplete is one of the most latency-sensitive features in any product. Users expect suggestions within 100ms of each keystroke. This post walks through the full low-level design: data models, trie vs Redis sorted sets, aggregation pipelines, debouncing, and ranking signals.

Requirements and Scale

  • 10M DAU, peak ~5K queries per second
  • p99 suggestion latency: 100ms end-to-end including network
  • Return top 5 suggestions per prefix
  • New trending queries must surface within 1 hour
  • Handle 26+ languages, profanity filtering, geo-specific results

At 10M DAU with average 5 searches per user per day, that is 50M queries/day or ~580 QPS sustained with 3-5x peak bursts.

Data Model

queries table (persistent store)

CREATE TABLE queries (
    query_text   VARCHAR(255) PRIMARY KEY,
    count        BIGINT DEFAULT 0,
    last_seen    TIMESTAMP,
    INDEX idx_count (count DESC)
);

prefix_suggestions cache (Redis)
One sorted set per prefix, score = ranking signal, member = query text. Keys look like pfx:sea, pfx:sear, pfx:searc, etc.

Trie with Top-K per Node

A classic trie stores characters at each node. To support fast top-K retrieval, each node also caches the top-5 query completions for that prefix – sorted by score.

class TrieNode:
    def __init__(self):
        self.children = {}
        self.top_k = []  # sorted list of (score, query_text), max 5 entries

class AutocompleteTrie:
    def __init__(self, k=5):
        self.root = TrieNode()
        self.k = k

    def insert(self, query: str, score: float):
        node = self.root
        for char in query:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
            self._update_top_k(node, query, score)

    def _update_top_k(self, node, query, score):
        # Remove old entry for this query if present
        node.top_k = [(s, q) for s, q in node.top_k if q != query]
        node.top_k.append((score, query))
        node.top_k.sort(reverse=True)
        node.top_k = node.top_k[:self.k]

    def search(self, prefix: str):
        node = self.root
        for char in prefix:
            if char not in node.children:
                return []
            node = node.children[char]
        return [q for _, q in node.top_k]

Why trie writes are expensive: When query “search engine” gets count 1001, you must update every ancestor node from root to the leaf – that is O(L) node updates where L = query length (up to 100 chars). With 1B queries/day you cannot do this synchronously. The trie is rebuilt periodically from aggregated data, not updated on every query.

Redis Sorted Set Approach

Redis sorted sets give you O(log N) insert and O(log N + K) range query. For each prefix of length 1 to len(query), store the query as a member with its score.

# When query "search engine" gets score 1500:
for i in range(1, len("search engine") + 1):
    prefix = "search engine"[:i]
    redis.zadd(f"pfx:{prefix}", {"search engine": 1500})

# Retrieve top 5 for prefix "sea":
results = redis.zrevrange("pfx:sea", 0, 4)

# Increment score (for incremental updates):
redis.zincrby("pfx:sea", 1, "search engine")

# Set TTL to avoid stale keys for obscure prefixes:
redis.expire("pfx:sea", 86400 * 7)  # 7 days

Memory estimate: Average query = 10 chars = 10 prefixes. Top 10K queries contribute 100K prefix-member entries. Each entry ~50 bytes. Total ~5MB for top 10K queries – very manageable. At top 1M queries: ~500MB, still fine for a Redis instance.

ZINCRBY vs ZADD: Use ZADD NX to add new entries and ZINCRBY for incremental updates from the aggregation pipeline. Never call ZINCRBY on every raw query – batch first.

Query Aggregation Pipeline

Direct write-per-query to Redis would be 580 ZINCRBY calls per second per prefix, multiplied by 10 chars average = 5,800 Redis writes/sec. Manageable but unnecessary. Batching reduces load and smooths spikes.

# Simplified aggregation pipeline (runs every 5-15 minutes)
def aggregate_and_update(log_lines):
    from collections import Counter
    import math, time

    counts = Counter()
    for line in log_lines:
        query = line['query'].lower().strip()
        if 2 <= len(query) <= 100:
            counts[query] += 1

    for query, delta in counts.items():
        # Update persistent store
        db.execute("""
            INSERT INTO queries (query_text, count, last_seen)
            VALUES (%s, %s, NOW())
            ON DUPLICATE KEY UPDATE
                count = count + %s,
                last_seen = NOW()
        """, (query, delta, delta))

        # Recalculate score from DB
        row = db.fetchone("SELECT count, last_seen FROM queries WHERE query_text=%s", (query,))
        age_hours = (time.time() - row['last_seen'].timestamp()) / 3600
        score = row['count'] * math.exp(-age_hours / 168)  # 7-day half-life

        # Update all prefixes in Redis
        pipe = redis.pipeline()
        for i in range(1, len(query) + 1):
            prefix = query[:i]
            pipe.zadd(f"pfx:{prefix}", {query: score})
            pipe.expire(f"pfx:{prefix}", 86400 * 7)
        pipe.execute()

For trending queries (must surface within 1 hour), run a separate high-frequency pipeline every 5 minutes on the last 5 minutes of query logs.

Frontend Debouncing

Without debouncing, a user typing “search” fires 6 requests: “s”, “se”, “sea”, “sear”, “searc”, “search”. Debouncing waits until keystrokes pause before sending.

// JavaScript debounce for typeahead
let debounceTimer = null;
let activeController = null;

function onInput(value) {
    clearTimeout(debounceTimer);
    if (activeController) {
        activeController.abort();  // cancel in-flight request
    }
    debounceTimer = setTimeout(() => {
        if (value.length  r.json())
        .then(renderSuggestions)
        .catch(e => { if (e.name !== 'AbortError') console.error(e); });
    }, 200);  // 150-300ms is the sweet spot
}

200ms debounce reduces requests by ~70% for average typists (5 chars/sec). For slow typists it fires on almost every character; for fast typists it fires only once or twice per word.

Ranking Signal

Pure frequency ranking surfaces stale queries. A recency-weighted score handles trending topics:

import math

def compute_score(count: int, last_seen_hours_ago: float) -> float:
    # 7-day half-life: a query from 168 hours ago has half the weight
    recency_weight = math.exp(-last_seen_hours_ago / 168)
    return count * recency_weight

# Personalization bonus: if user has searched this query before
def compute_personalized_score(base_score, user_query_count):
    return base_score * (1 + 0.3 * min(user_query_count, 5))

Personalization is applied client-side or in a thin personalization layer, not in the shared Redis cache. The shared cache stores the global top-5; the client re-ranks by mixing in personal history.

Filtering

  • Block list: A Redis set blocked_queries checked before returning results. Managed by a content moderation team.
  • Country filtering: Per-country prefix keys (pfx:us:sea, pfx:de:sea) or a post-filter layer that removes geo-restricted content.
  • Minimum frequency threshold: Only promote queries with count > 10 to avoid surfacing typos and one-off searches.

Architecture

Client Browser
    |
    | (debounced HTTP GET /suggest?q=sea)
    v
CDN Edge Cache (cache popular prefixes, 10s TTL)
    |
    | (cache miss)
    v
Suggestion Service (stateless, horizontally scaled)
    |
    v
Redis Cluster (sorted sets per prefix)
    |
    | (async)
    v
Query Log Kafka Topic
    |
    v
Batch Aggregator (runs every 5-15 min)
    |
    v
MySQL queries table  -->  Redis Cluster (score updates)

Scale Numbers

  • Average query length: 10 chars = 10 prefixes generated per query
  • 1B queries/day = 10B prefix entries to update per day = ~115K prefix updates/sec
  • With batching: aggregate first, then update unique (query, prefix) pairs only – reduces to ~1M unique prefix updates per batch cycle
  • CDN caches the top ~1000 prefixes (2-3 chars) – covers ~80% of traffic
  • Redis memory for 10M unique queries at 10 prefixes each: ~5GB – one mid-size Redis node
  • Read path: CDN hit – <5ms; Redis hit – <10ms; well within 100ms p99

The key insight: the read path (CDN + Redis) is simple and fast. All complexity is pushed to the write path (aggregation pipeline), which runs asynchronously and can tolerate higher latency.

Twitter search requires autocomplete at massive scale. See system design questions for Twitter/X interview: search autocomplete system design.

LinkedIn search uses typeahead and autocomplete systems. See design patterns for LinkedIn interview: typeahead and search autocomplete design.

Snap search and discovery use autocomplete features. See system design patterns for Snap interview: search and autocomplete system design.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top