System Design Interview: Design a Search Engine (Elasticsearch)

⏱ 3 min read

System Design Interview: Design a Search Engine

Designing a search engine tests your understanding of inverted indexes, relevance ranking, distributed indexing, and query processing pipelines. Companies like Google, Elastic, Algolia, and LinkedIn ask variations of this problem.

Requirements Clarification

Functional Requirements

Index documents from a corpus (web pages, product listings, articles)
Support full-text search with keyword queries
Return results ranked by relevance
Support phrase search, boolean operators (AND, OR, NOT)
Near-real-time indexing (new docs searchable within seconds to minutes)

Non-Functional Requirements

Scale: 10B documents, 50K queries per second
Latency: p99 query < 100ms
Availability: 99.99% (search must always be available)
Index freshness: new documents indexed within 60 seconds

Core Data Structure: Inverted Index

An inverted index maps terms to the list of documents containing them:

"apple" -> [(doc1, pos:[2,15]), (doc5, pos:[1]), (doc12, pos:[7,23])]
"banana" -> [(doc3, pos:[5]), (doc5, pos:[4])]

Each posting stores: document ID, term frequency, positions (for phrase search), field (title vs body weight differently).

Index Segments

Elasticsearch/Lucene write immutable segments. New documents go to an in-memory buffer, flushed to a new segment every second (near-real-time). Segments are periodically merged (tiered merging). Deletes are marked with tombstones and removed on merge.

High-Level Architecture

Crawlers/Ingest -> Document Queue (Kafka)
                        |
              Document Processing Pipeline
              (tokenize, normalize, extract fields)
                        |
              Index Writers (shard by doc_id % N)
              -> Segment files on SSD
                        |
              Query Router -> Scatter-Gather
              -> N Index Shards (each has primary + replicas)
              -> Merge & Rank results
                        |
              Result Cache (Redis) -> Client

Document Processing Pipeline

Tokenization: split text into tokens on whitespace and punctuation
Normalization: lowercase, remove diacritics
Stop word removal: remove “the”, “is”, “at” (configurable)
Stemming / Lemmatization: “running” to “run” (Porter stemmer, Snowball)
Synonyms: expand query or index with synonym dictionary
Field extraction: title (3x weight), body (1x), tags (2x)

Relevance Ranking

TF-IDF

Term Frequency x Inverse Document Frequency. TF = occurrences of term in doc / total terms. IDF = log(N / df) where df = docs containing term. Penalizes common words, rewards rare discriminating terms.

BM25 (Okapi BM25)

BM25 improves TF-IDF with document length normalization and saturation. BM25 is the default in Elasticsearch, Lucene, and most modern systems with k1 ~ 1.2 and b ~ 0.75.

Learning to Rank (LTR)

Train a gradient boosting model (LambdaMART, XGBoost) on click-through data. Features: BM25 score, PageRank, click rate, freshness, user personalization signals. Rerank top-K BM25 results with LTR model at query time.

Distributed Architecture

Sharding

Shard index by document ID (hash-based). Each shard = one Lucene index. With 10B docs and 100 shards, each shard holds 100M docs. Query broadcasts to all shards (scatter), each returns top-K, coordinator merges (gather) and returns global top-K.

Replication

Each shard has 1 primary + 2 replicas. Reads served from any replica (load balance). Writes go to primary, replicated async to replicas. Replicas also serve queries for linear read scalability.

Query Path

Parse query (Lucene query parser, DSL)
Query rewrite (synonym expansion, spell correction)
Scatter to all shards in parallel
Each shard returns top-K (doc_id, score) pairs
Coordinator merges, fetches stored fields for top results
Apply LTR reranking
Return to client

Near-Real-Time Indexing

Lucene refresh interval: flush in-memory buffer to new searchable segment every 1 second. Segment is not fsynced yet (fast). fsync happens every 30s (translog ensures durability). New segment immediately visible to searches. This gives ~1s indexing latency without full commit overhead.

Caching Strategy

Query cache: cache results for identical queries (LRU, evict on index update)
Filter cache: cache bitsets for filter clauses (term filters, range filters)
Field data cache: cache field values for sorting/aggregations
Result cache: Redis cache for top search results with short TTL (5-30s)

Spell Correction and Autocomplete

Spell correction: Edit distance (Levenshtein) against indexed terms. Build a finite state transducer (FST) for fast fuzzy lookup. Return did-you-mean for queries with 0 or few results.
Autocomplete: Prefix trie or edge n-gram tokenizer. Store popular queries in Redis sorted set by frequency. Return top-5 completions as user types.

Scale Numbers

10B documents x 1KB avg = 10TB raw data
Inverted index is approximately 30-50% of raw = 3-5TB
100 shards, 3 replicas = 300 shard instances
50K QPS across 100 shards = 500 shard-queries/s per shard
Segment merge I/O: design for SSDs, tune merge policy

Interview Tips

Start with the inverted index – it is the core data structure
Explain the trade-off between index freshness and merge overhead
Discuss BM25 vs TF-IDF; mention LTR as the production approach
Know scatter-gather query execution and why top-K merging is tricky
Mention Elasticsearch, Solr, or Typesense as real implementations