What is an inverted index and how is it built?

An inverted index maps each term to a posting list — the list of documents containing that term, along with position and frequency information. Building: tokenize documents into terms, normalize (lowercase, stem, remove stop words), then for each term add the document ID to its posting list. The index is stored sorted by term for binary search lookup. Most search engines build segments incrementally and merge them in the background.

What is BM25 and how does it improve on TF-IDF?

BM25 (Best Match 25) is a probabilistic ranking function. Like TF-IDF, it rewards terms that appear frequently in a document and penalizes terms that appear in many documents. BM25 improves TF-IDF with term frequency saturation: additional occurrences of a term have diminishing returns (controlled by parameter k1, typically 1.2-2.0). It also includes document length normalization (parameter b, typically 0.75) to prevent long documents from dominating purely due to size.

How does Elasticsearch distribute search across shards?

A query goes to a coordinator node, which broadcasts it to one shard copy (primary or replica) per index shard in parallel. Each shard searches its local Lucene index and returns the top-N doc IDs and scores. The coordinator merges all results, selects the global top-N, then fetches the full documents for those top-N results from the relevant shards. This scatter-gather pattern scales read throughput with shard count.

How do you handle phrase queries in a full-text index?

Phrase queries require a positional index: for each term occurrence, store not just the document ID but also the position within the document. A phrase query for "machine learning" finds documents where "machine" and "learning" appear consecutively (positions differ by exactly 1). Proximity queries (NEAR/N) allow positions to differ by at most N. Positional indexes are larger but enable precise phrase and proximity matching.

What is the segment merge process in Lucene?

Lucene writes new documents to small segments (immutable files). Segments accumulate over time. Background merges combine smaller segments into larger ones, reducing the number of segments searched per query. Merging also removes deleted documents (logical deletions are stored as bitsets until merge). The merge policy (TieredMergePolicy by default) controls when and how segments are merged based on size and segment count.

Low Level Design: Full-Text Search Engine

⏱ 8 min read

What Is a Full-Text Search Engine?

A full-text search engine indexes large collections of text documents and retrieves the most relevant ones for a given query in milliseconds. The core challenge is that naive linear scan (grep every document) doesn’t scale past thousands of documents. The solution is to pre-build an inverted index at write time so reads are fast lookups rather than full scans. Systems like Lucene, Elasticsearch, Solr, and Typesense are built on this foundation.

Inverted Index Structure

An inverted index maps each unique term to a posting list: the list of documents containing that term. Each entry in the posting list typically stores the document ID, the term frequency within that document, and optionally the positions (character offsets) where the term appears. For example:

"database" → [(doc3, freq=4, positions=[12,87,204,391]), (doc7, freq=1, positions=[55]), ...]

At query time, looking up a term is a direct map lookup (O(1) or O(log n)), returning a posting list that can be intersected, unioned, or ranked. The inverted index is the reason search engines return results in milliseconds over billions of documents.

Tokenization and Normalization

Tokenization splits raw text into terms (tokens). The simplest tokenizer splits on whitespace and punctuation. More sophisticated analyzers handle contractions (don’t → do, n’t), hyphenated words, URLs, and language-specific rules. The output is a stream of tokens that will be stored in the inverted index.

Normalization ensures that Database, DATABASE, and database all map to the same term. This includes lowercasing and Unicode normalization (NFC/NFD). Without normalization, queries would be case-sensitive and miss obvious matches. Normalization is applied identically at index time and query time — both the stored terms and the query terms go through the same analysis pipeline.

Stemming and Stop Words

Stemming reduces inflected or derived words to a common base form. The Porter stemmer (1980) and its successor Snowball handle English: running, runs, and runner all reduce to run. This increases recall — a query for "run" matches documents containing "running." The tradeoff is occasional over-stemming (university and universe both stem to univers) and under-stemming. Lemmatization is more linguistically accurate (uses dictionary lookups) but slower.

Stop words are extremely common words (a, the, is, of) that appear in nearly every document and add little discriminative signal. Removing them at index time reduces index size and speeds up posting list intersections. However, some use cases require stop words — phrase queries like "to be or not to be" break without them — so modern engines often keep stop words but downweight them rather than remove them entirely.

TF-IDF Scoring

TF-IDF (Term Frequency–Inverse Document Frequency) is the classic relevance scoring formula. TF measures how often the query term appears in a document (more occurrences = more relevant). IDF measures how rare the term is across the entire corpus — rare terms are more discriminative. A term appearing in every document gets IDF ≈ 0 (it’s useless for ranking); a term appearing in only one document gets high IDF.

The combined TF-IDF score for a term in a document: score = TF(t,d) × log(N / df(t)), where N is the total number of documents and df(t) is the number of documents containing term t. The final score for a multi-term query is the sum of TF-IDF scores across all query terms. Documents are ranked by descending score.

BM25: The Current Standard

Okapi BM25 (Best Match 25) improved on TF-IDF in two important ways. First, it adds TF saturation: in raw TF-IDF, a document with a term 100 times scores 10x more than one with 10 times. BM25 uses a saturation curve controlled by parameter k1 (typically 1.2–2.0): additional occurrences matter less and less. Second, it adds document length normalization via parameter b (typically 0.75): a term appearing 5 times in a short document is more significant than in a long document.

BM25 formula: score(t,d) = IDF(t) × (TF(t,d) × (k1+1)) / (TF(t,d) + k1 × (1 - b + b × |d|/avgdl)). BM25 is now the default ranking function in Elasticsearch (since version 5), Lucene, and most modern search engines. It consistently outperforms plain TF-IDF on standard IR benchmarks.

Phrase Queries and Positional Index

A phrase query — "machine learning" — requires that both terms appear adjacent in the correct order, not just anywhere in the document. This requires storing positions in the posting list: the character or token offset of each term occurrence. At query time, after finding documents containing both "machine" and "learning," the engine checks whether any occurrence of "machine" is immediately followed by "learning."

Proximity queries generalize this: find documents where two terms appear within N tokens of each other. Positional data significantly increases index size (often 3–5x vs. no positions) but is required for phrase search. Lucene stores positions as delta-encoded integers within the posting list, keeping the overhead manageable.

Inverted Index Compression

Posting lists for common terms can contain millions of document IDs. Without compression, the index would be enormous. Two key techniques keep it manageable:

Delta encoding of doc IDs: posting lists are sorted by doc ID. Instead of storing raw doc IDs, store differences between consecutive IDs. Since IDs increase monotonically, deltas are small positive integers. A list like [3, 17, 32, 33, 50] becomes [3, 14, 15, 1, 17]. Small integers compress far better than large ones.

Variable-byte encoding: use fewer bytes for small integers and more for large ones. Numbers < 128 fit in 1 byte; numbers < 16384 fit in 2 bytes. This achieves good compression without the decoding overhead of bit-packing schemes. Lucene also uses PForDelta (PFOR) and Simple9/Simple16 for block-level integer compression on large posting lists.

Segment-Based Architecture (Lucene)

Lucene organizes its index as a collection of immutable segments. When documents are indexed, they first go into a small in-memory buffer. Periodically (or when the buffer is full), the buffer is flushed to disk as a new segment — a complete, self-contained mini-index. Each segment has its own inverted index, stored fields, doc values, and term dictionary.

Segments are immutable: once written, they are never modified. Deletes are handled with a deletion bitset — a bitmap marking deleted doc IDs — applied at query time to filter out results. The trade-off: many small segments slow down queries (must search all segments and merge results). A background merge process continuously merges smaller segments into larger ones, following a tiered merge policy. Merging also physically removes deleted documents and rewrites the index compactly.

Distributed Search: Elasticsearch

Elasticsearch distributes a Lucene index across multiple nodes using shards. An index is divided into N primary shards (chosen at index creation, cannot be changed without reindex). Each primary shard is a full Lucene index. Replicas are read-only copies of primary shards on different nodes, providing both fault tolerance and read scaling.

A query uses scatter-gather: the coordinating node fans out the query to one copy of each shard, each shard returns its top-K results with scores, the coordinator merges all results globally, re-ranks, and returns the final top-K. Shard routing for document writes uses hash(doc_id) % num_shards by default, distributing documents evenly. Custom routing (e.g., by tenant ID) can co-locate related documents on the same shard for query efficiency. Relevance tuning options include field-level boosts (title matches outweigh body matches), function scores (boost by recency or popularity), and query-time synonym expansion.