System Design Interview: Design a Search Engine
Designing a search engine tests your understanding of inverted indexes, relevance ranking, distributed indexing, and query processing pipelines. Companies like Google, Elastic, Algolia, and LinkedIn ask variations of this problem.
Requirements Clarification
Functional Requirements
- Index documents from a corpus (web pages, product listings, articles)
- Support full-text search with keyword queries
- Return results ranked by relevance
- Support phrase search, boolean operators (AND, OR, NOT)
- Near-real-time indexing (new docs searchable within seconds to minutes)
Non-Functional Requirements
- Scale: 10B documents, 50K queries per second
- Latency: p99 query < 100ms
- Availability: 99.99% (search must always be available)
- Index freshness: new documents indexed within 60 seconds
Core Data Structure: Inverted Index
An inverted index maps terms to the list of documents containing them:
"apple" -> [(doc1, pos:[2,15]), (doc5, pos:[1]), (doc12, pos:[7,23])]
"banana" -> [(doc3, pos:[5]), (doc5, pos:[4])]
Each posting stores: document ID, term frequency, positions (for phrase search), field (title vs body weight differently).
Index Segments
Elasticsearch/Lucene write immutable segments. New documents go to an in-memory buffer, flushed to a new segment every second (near-real-time). Segments are periodically merged (tiered merging). Deletes are marked with tombstones and removed on merge.
High-Level Architecture
Crawlers/Ingest -> Document Queue (Kafka)
|
Document Processing Pipeline
(tokenize, normalize, extract fields)
|
Index Writers (shard by doc_id % N)
-> Segment files on SSD
|
Query Router -> Scatter-Gather
-> N Index Shards (each has primary + replicas)
-> Merge & Rank results
|
Result Cache (Redis) -> Client
Document Processing Pipeline
- Tokenization: split text into tokens on whitespace and punctuation
- Normalization: lowercase, remove diacritics
- Stop word removal: remove “the”, “is”, “at” (configurable)
- Stemming / Lemmatization: “running” to “run” (Porter stemmer, Snowball)
- Synonyms: expand query or index with synonym dictionary
- Field extraction: title (3x weight), body (1x), tags (2x)
Relevance Ranking
TF-IDF
Term Frequency x Inverse Document Frequency. TF = occurrences of term in doc / total terms. IDF = log(N / df) where df = docs containing term. Penalizes common words, rewards rare discriminating terms.
BM25 (Okapi BM25)
BM25 improves TF-IDF with document length normalization and saturation. BM25 is the default in Elasticsearch, Lucene, and most modern systems with k1 ~ 1.2 and b ~ 0.75.
Learning to Rank (LTR)
Train a gradient boosting model (LambdaMART, XGBoost) on click-through data. Features: BM25 score, PageRank, click rate, freshness, user personalization signals. Rerank top-K BM25 results with LTR model at query time.
Distributed Architecture
Sharding
Shard index by document ID (hash-based). Each shard = one Lucene index. With 10B docs and 100 shards, each shard holds 100M docs. Query broadcasts to all shards (scatter), each returns top-K, coordinator merges (gather) and returns global top-K.
Replication
Each shard has 1 primary + 2 replicas. Reads served from any replica (load balance). Writes go to primary, replicated async to replicas. Replicas also serve queries for linear read scalability.
Query Path
- Parse query (Lucene query parser, DSL)
- Query rewrite (synonym expansion, spell correction)
- Scatter to all shards in parallel
- Each shard returns top-K (doc_id, score) pairs
- Coordinator merges, fetches stored fields for top results
- Apply LTR reranking
- Return to client
Near-Real-Time Indexing
Lucene refresh interval: flush in-memory buffer to new searchable segment every 1 second. Segment is not fsynced yet (fast). fsync happens every 30s (translog ensures durability). New segment immediately visible to searches. This gives ~1s indexing latency without full commit overhead.
Caching Strategy
- Query cache: cache results for identical queries (LRU, evict on index update)
- Filter cache: cache bitsets for filter clauses (term filters, range filters)
- Field data cache: cache field values for sorting/aggregations
- Result cache: Redis cache for top search results with short TTL (5-30s)
Spell Correction and Autocomplete
- Spell correction: Edit distance (Levenshtein) against indexed terms. Build a finite state transducer (FST) for fast fuzzy lookup. Return did-you-mean for queries with 0 or few results.
- Autocomplete: Prefix trie or edge n-gram tokenizer. Store popular queries in Redis sorted set by frequency. Return top-5 completions as user types.
Scale Numbers
- 10B documents x 1KB avg = 10TB raw data
- Inverted index is approximately 30-50% of raw = 3-5TB
- 100 shards, 3 replicas = 300 shard instances
- 50K QPS across 100 shards = 500 shard-queries/s per shard
- Segment merge I/O: design for SSDs, tune merge policy
Interview Tips
- Start with the inverted index – it is the core data structure
- Explain the trade-off between index freshness and merge overhead
- Discuss BM25 vs TF-IDF; mention LTR as the production approach
- Know scatter-gather query execution and why top-K merging is tricky
- Mention Elasticsearch, Solr, or Typesense as real implementations