Typeahead Service System Design Overview
A typeahead service extends basic search suggest by adding session context awareness, distributed prefix indexing across multiple data domains, and intelligent debounce batching to reduce server load. Where a simple suggest system returns global completions, a typeahead service learns from the current session and nearby entity context to surface the most relevant completions first.
Requirements
Functional Requirements
- Return ranked completions for user, hashtag, location, and query entity types.
- Boost results from entities the user has recently interacted with in the current session.
- Batch client-side keystrokes with debounce to reduce unnecessary server round trips.
- Support wildcard prefix matching for partial-word queries.
- Allow product teams to register new entity type indexes without service redeployment.
Non-Functional Requirements
- P99 latency under 80ms from server receipt to response.
- Support 100,000 concurrent active typeahead sessions.
- Index update propagation under 30 seconds for new entities.
- Horizontal scalability with no shared mutable state between service instances.
Data Model
- prefix_index: Redis sorted set per (entity_type, locale, prefix) storing top-100 entity_ids by base_score.
- entity_metadata: Redis hash per entity_id: display_name, entity_type, avatar_url, follower_count, base_score.
- session_context: Redis hash per session_id: recently viewed entity_ids with interaction timestamps, TTL 30 minutes.
- index_registry: Postgres table (entity_type, shard_key, index_version, last_rebuilt_at) for operational visibility.
Core Algorithms
Distributed Prefix Index
Each entity type maintains its own prefix index sharded by the first character of the prefix. This spreads load across 26 shards per entity type and keeps each sorted set small enough for sub-millisecond ZREVRANGE lookups. Index writes are funneled through a Kafka topic per entity type. A pool of index writer workers consumes events, computes all prefixes for the entity name (all substrings from position 0 up to the full name, capped at 20 characters), and issues ZADD commands to the appropriate Redis shard. Writes are idempotent: re-indexing an entity with an updated score simply overwrites the existing ZADD entry.
Debounce Batching
The client SDK implements a 150ms debounce window. Keystrokes within 150ms of each other are collapsed into a single request carrying the latest prefix. The server also performs server-side coalescing: if two requests from the same session_id arrive within 50ms with overlapping prefixes, the earlier request is dropped and only the longer prefix is processed. This is implemented via a Redis key (session_id:inflight) set to the current prefix with a 50ms TTL; if the key already exists with the same or shorter prefix when a new request arrives, the old response future is cancelled.
Session-Context Boosting
Before returning results, the service reads the sessions recently interacted entity_ids from Redis. For each candidate entity in the result set, an affinity boost of up to 0.5 is added to the base_score if the entity appears in session context, scaled by recency (linear decay over 30 minutes). This ensures that if a user has just visited a profile, that profile appears at the top of typeahead when they begin typing its name again.
Cache Warming
Cold starts after index rebuilds cause latency spikes on the first request for each prefix. A warming job runs immediately after each index rebuild and issues synthetic requests for the top 5,000 prefixes per entity type, populating the local in-process LRU cache (512MB per service instance) with freshly fetched result sets. The LRU cache uses a 10-second TTL to stay consistent with index updates while absorbing repeated identical prefix lookups within the debounce window.
Scalability
Service instances are stateless; session context is stored exclusively in Redis. Horizontal scaling is triggered by CPU utilization above 70%, with a minimum of 10 instances and a maximum of 200. The Redis prefix index cluster uses 64 shards with replication factor 2. A circuit breaker per entity type trips after 5 consecutive timeouts, returning empty results for that type while continuing to serve other entity types, degrading gracefully rather than failing entirely.
API Design
GET /v1/typeahead
- Query params: q (prefix), types (comma-separated entity types, default all), session_id, locale, limit (default 8 per type)
- Response: grouped result object: {users: […], hashtags: […], locations: […], queries: […]}
POST /v1/typeahead/context
- Body: session_id, entity_id, entity_type, interaction_type (view, click, select)
- Response: 204 No Content
- Called by client on every entity interaction to update session context.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is a distributed prefix index structured for a typeahead service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The prefix space is partitioned across nodes by prefix range or consistent hash, so each node owns a shard of the trie. A routing layer maps an incoming prefix to the responsible shard(s). Each shard maintains an in-memory trie with top-K results per node. For cross-shard prefixes (e.g., a prefix that straddles a partition boundary), a scatter-gather fan-out queries all relevant shards and merges results by score.”
}
},
{
“@type”: “Question”,
“name”: “How does debounce batching reduce load on a typeahead service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The client waits for a pause in keystrokes (typically 150–300ms) before sending a request, canceling any pending in-flight request for the previous prefix. On the server side, identical concurrent prefix queries within a short window can be coalesced via a request-collapsing layer (e.g., a singleflight pattern), so only one backend lookup executes for N simultaneous identical queries, and all callers share the result.”
}
},
{
“@type”: “Question”,
“name”: “What is session-context boosting in a typeahead service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Session-context boosting adjusts suggestion scores based on signals from the current user session—recent queries, current page context, items already in a shopping cart, or entities previously interacted with. These signals are maintained in a lightweight session store (e.g., Redis with a short TTL) and injected into the re-ranking step at query time, making suggestions more relevant to what the user is doing right now without requiring a full personalization model.”
}
},
{
“@type”: “Question”,
“name”: “How do you warm the cache for a typeahead service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Cache warming pre-populates results for the most frequent prefixes before traffic hits the service. A batch job reads the top-N prefixes from query logs, computes suggestions for each, and writes them to the cache (CDN or in-process). This runs on deployment and on a scheduled cadence (e.g., hourly). Newly trending terms trigger incremental cache invalidation and re-computation via a streaming pipeline that monitors query frequency spikes.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering