What is an inverted index and how does it enable fast text search?

An inverted index maps each unique term (word) to a list of documents containing that term, called a posting list. Example: term "python" → [(doc:1, freq:3), (doc:4, freq:1)]. This is the inverse of the normal document → words mapping. A query for "python AND interview" finds the intersection of the posting lists for "python" and "interview" using a two-pointer merge on the sorted lists — O(|list1| + |list2|), far faster than scanning every document. Before indexing, text goes through an analysis pipeline: tokenize (split into words), lowercase, remove stop words ("the", "is"), and stem ("running" → "run"). This means a search for "running" also matches documents containing "run" or "runs". Posting lists also store term position data (which word position in the document), enabling phrase queries like "system design" to match only documents where these words are adjacent.

How does Elasticsearch scale to billions of documents?

Elasticsearch uses sharding to distribute the document corpus horizontally. A sharding strategy: shard = hash(document_id) % num_shards, so each document always goes to the same shard. Each shard is an independent Lucene index. Query execution: the coordinating node sends the query to all shards in parallel; each shard returns its top-K results; the coordinator merges and re-ranks the results, returning the global top-K. Adding more shards increases both indexing and query throughput. Replication: each primary shard has replica shards on different nodes. Writes go to the primary (which replicates to replicas). Reads can be served by any replica. With replication_factor=2 (3 copies total), the cluster survives the loss of any single node. Shard count is chosen at index creation time and cannot be changed without reindexing — choose with headroom (e.g., 2-5x expected document count).

How does BM25 improve on TF-IDF for search relevance?

TF-IDF scores documents by: TF (term frequency in document) * IDF (inverse document frequency — rare terms score higher). BM25 extends TF-IDF with two improvements: (1) Diminishing returns on term frequency — TF(t,d) in BM25 is bounded: score grows as TF increases but approaches a ceiling. A document with 100 occurrences of "python" doesn't score 100x better than one with 1 occurrence. (2) Document length normalization — shorter documents that mention a term score higher than long documents with the same term count. Intuition: a 100-word document mentioning "python" once is more specifically about Python than a 10,000-word general programming textbook with one "python" mention. The k1 parameter (default 1.2) controls TF saturation; b parameter (default 0.75) controls length normalization strength. BM25 consistently outperforms TF-IDF on standard information retrieval benchmarks and is the Elasticsearch default.

System Design Interview: Design a Distributed Search Engine

⏱ 6 min read

System Design: Distributed Search Engine

A distributed search engine indexes documents and serves queries with sub-100ms latency at massive scale. Elasticsearch powers search at Wikipedia, GitHub, Uber, and Airbnb. This guide covers the architecture of building a search system from scratch — the inverted index, distributed query execution, relevance scoring, and operational concerns.

Requirements

Functional: Index documents (text, structured fields). Full-text search with ranking by relevance. Boolean queries (AND, OR, NOT). Faceted search (filter by category, price range). Near-real-time indexing (<1 second from write to searchable).

Non-functional: 1 billion documents, 10,000 QPS search, p99 < 100ms, horizontal scalability, fault tolerance.

Core Data Structure: Inverted Index

An inverted index maps each term (word) to a list of document IDs that contain it — the inverse of a document containing a list of words.

Term        → Posting List (doc_id, position, frequency)
"python"    → [(doc:1, pos:5, freq:3), (doc:4, pos:1, freq:1), (doc:9, pos:12, freq:2)]
"interview" → [(doc:1, pos:8, freq:1), (doc:3, pos:2, freq:4)]
"design"    → [(doc:2, pos:1, freq:2), (doc:3, pos:7, freq:1), (doc:9, pos:3, freq:5)]

Query “python AND interview” = intersection of posting lists for “python” and “interview”. Posting lists are stored sorted by doc_id, enabling efficient merge with two pointers in O(|posting1| + |posting2|). Position data enables phrase queries (“system design” as adjacent words).

Indexing Pipeline

Document → Tokenizer → Analyzer (lowercase, stemming, stopwords) → Token Stream
                                                                         ↓
                                                               Inverted Index Update
                                                                         ↓
                                                             Segment File (immutable)

Text analysis steps:

Tokenization: Split text into tokens (“System Design” → [“System”, “Design”])
Lowercasing: “Python” → “python”
Stop word removal: Remove common words (“the”, “is”, “at”) that add noise
Stemming/Lemmatization: “running” → “run”, “designs” → “design”. Enables matching “design” queries to “designs” in documents.
Synonym expansion: “car” → also index “automobile”

Lucene Segment Architecture

Lucene (the engine under Elasticsearch) uses an immutable segment design. New documents go into an in-memory buffer. Periodically (every 1 second in Elasticsearch), the buffer is flushed to a new immutable segment file on disk. This makes each write O(1) — just append to the buffer.

Segment merging: Many small segments accumulate over time. Background threads merge small segments into larger ones. Merging improves query performance (fewer segments to search) and reclaims space from deleted documents. Deleted documents are tracked with a tombstone bitset — physically removed during merges.

Query execution: Each query searches all segments in parallel and merges results. 5 segments × 200ms each → parallelized → effectively 200ms total (not 1000ms serial).

Distributed Architecture

Client → Load Balancer → Coordinator Node
                              ↓
              [shard_0 on Node1]  [shard_1 on Node2]  [shard_2 on Node3]
                              ↓
              Coordinator merges results and returns top-K to client

Sharding: The document corpus is split into shards. Shard assignment: shard = hash(document_id) % num_shards. Each shard is a complete Lucene index. A query fans out to all shards in parallel, each returns its top-K results, the coordinator merges and re-ranks.

Replication: Each shard has a primary + N replicas. Writes go to the primary and replicate. Reads can be served by any replica (load balancing). Replication factor = 2 (3 copies total) is standard for fault tolerance — survives loss of any one node.

Relevance Scoring: BM25

BM25 (Best Match 25) is the standard relevance ranking algorithm, used by Elasticsearch by default. It extends TF-IDF with document length normalization:

BM25(q, d) = Σ IDF(t) * (TF(t,d) * (k1+1)) / (TF(t,d) + k1*(1-b+b*|d|/avgdl))

Where:
  IDF(t)  = log((N - df(t) + 0.5) / (df(t) + 0.5) + 1)   # rare terms score higher
  TF(t,d) = frequency of term t in document d
  |d|     = length of document d in tokens
  avgdl   = average document length in corpus
  k1=1.2, b=0.75  # tuning parameters

Key properties: rare terms (low document frequency) score higher (IDF). High term frequency helps but with diminishing returns (the denominator grows). Short documents score higher for the same term frequency (length normalization with b).

Near-Real-Time Indexing

Elasticsearch’s default refresh_interval=1s: every second, in-memory changes are flushed to a new segment and become searchable. The write path: documents → in-memory translog (durability) → buffer. The read path: searches the buffer + all committed segments. Translog acts as a WAL (write-ahead log) — on crash, replays uncommitted buffer from the translog. fsync (durable commit) happens every 30 seconds or on explicit flush — this is the expensive operation, not the 1-second refresh.

Query Types

Term query: Exact match on a field. “status:active” → looks up posting list for “active” in the status field.
Match query: Full-text. “python interview” → analyzes input, generates term queries, OR/AND of posting lists.
Phrase query: “system design” as adjacent words. Uses position data in posting lists.
Range query: “price: [10 TO 100]”. Uses a numeric index (doc values) not the inverted index.
Fuzzy query: “phyton” matches “python” with edit distance ≤ 2. Uses BK-tree or automaton.

Interview Tips

Explain the inverted index clearly — it’s the core data structure.
Mention the sharding fan-out pattern: query → all shards in parallel → coordinator merges.
BM25 vs TF-IDF: BM25 is strictly better. Know the length normalization intuition.
Near-real-time vs real-time: explain the 1-second refresh window and when to use force refresh.
Faceted search: use aggregations (bucket by field value, compute counts) run in parallel with the search query on the same shards.

🏢 Asked at: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

🏢 Asked at: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

🏢 Asked at: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering

🏢 Asked at: Cloudflare Interview Guide

🏢 Asked at: Atlassian Interview Guide