A search engine indexes documents and answers queries of the form “find documents containing these terms” in milliseconds, even across billions of documents. The core data structure is the inverted index: a mapping from each term to the list of documents containing that term (the posting list). Lucene (the search library powering Elasticsearch, Solr, and ClickHouse full-text search) implements a highly optimized inverted index. Understanding search engine internals — indexing pipeline, posting list compression, ranking with TF-IDF and BM25, query parsing — is valuable for system design interviews involving text search at scale.
Inverted Index Construction
Building an inverted index: Tokenization: split text into tokens (words). “The quick brown fox” → [“the”, “quick”, “brown”, “fox”]. Normalization: lowercase, remove punctuation, expand abbreviations. Stop word removal: remove high-frequency low-information words (“the”, “a”, “is”) — they would appear in nearly every posting list, wasting space and query time. Stemming/Lemmatization: reduce words to their root form (“running” → “run”, “better” → “good” for lemmatization). Posting list construction: for each term, record the list of (document_id, term_frequency, positions) tuples. Sort by document_id for efficient set operations (intersection, union). Store term frequency and position data for ranking and phrase queries.
# Inverted index structure example
inverted_index = {
"quick": [(doc_id=1, tf=2, positions=[3,17]), (doc_id=5, tf=1, positions=[8])],
"brown": [(doc_id=1, tf=1, positions=[4]), (doc_id=3, tf=2, positions=[1,22])],
"fox": [(doc_id=1, tf=1, positions=[5]), (doc_id=3, tf=1, positions=[23])],
}
# Boolean AND query: "quick AND fox"
# Merge posting lists: docs containing both "quick" AND "fox"
# quick: [1, 5] fox: [1, 3]
# Merge (linear scan since both sorted): [1] → doc 1 matches
# Phrase query: "quick fox" (must be adjacent)
# From quick: doc 1, positions [3, 17]
# From fox: doc 1, positions [5]
# Check if any position of fox = position of quick + 1: 5 != 3+1=4 and 5 != 17+1=18
# No phrase match for "quick fox"
# Lucene segment structure:
# Each segment is an immutable inverted index shard
# New documents go to an in-memory buffer (IndexWriter)
# Periodically flushed as a new segment (segment_N)
# Segments are merged in the background to control count
# Deleted documents marked in .del file, filtered at query time
Ranking: TF-IDF and BM25
After finding matching documents, they must be ranked by relevance. TF-IDF (Term Frequency-Inverse Document Frequency): TF = how often the query term appears in the document (more = more relevant). IDF = log(total_docs / docs_containing_term) — rare terms are more informative (high IDF). A document matching a rare term scores higher than one matching a common term. Score = sum(TF * IDF) for each query term. BM25 (Best Match 25, used by Elasticsearch, Lucene): improves TF-IDF by adding document length normalization (a long document matching a term many times is less relevant than a short document matching it the same number of times) and tunable TF saturation (the relevance gain from additional term occurrences diminishes). BM25 is the standard baseline ranking function in information retrieval, used by default in Elasticsearch.
Elasticsearch Architecture
Elasticsearch distributes an index across multiple shards (primary shards for writes, replica shards for reads and failover). Each shard is a Lucene index. A query fans out to all shards, each shard scores and returns its top-k results, and the coordinator node merges all shard results to find the global top-k. Indexing: documents are parsed, analyzed, and written to a translog (for crash recovery) and the in-memory buffer. Periodic refreshes (default: 1 second) flush the buffer to a new Lucene segment, making documents searchable. Periodic merges reduce segment count to maintain query performance. Near-real-time search: documents are searchable within 1 second of indexing (configurable). This makes Elasticsearch suitable for log analytics, e-commerce search, and full-text search over structured data.
Key Interview Discussion Points
- Posting list compression: posting lists can be large (millions of doc IDs for common terms); delta encoding (store differences between consecutive doc IDs, which are small) + variable-length integer encoding (VByte, PForDelta) reduces posting list size by 5-10x
- Skip lists in posting lists: for large posting lists, skip pointers allow jumping ahead by hundreds of entries during intersection — when merging “the” (10M docs) AND “zebra” (100 docs), skip over large chunks of “the” quickly
- Faceted search: count of results per category (brand=Nike: 1500, brand=Adidas: 800) requires counting across all matching documents — implemented with doc values (columnar per-field storage) and aggregations in Elasticsearch
- Tokenizer choice: language-specific tokenizers (Japanese uses MeCab morphological analysis; Chinese uses character bigrams or jieba); n-gram tokenizers for partial matching (edge n-gram for autocomplete); phonetic tokenizers for fuzzy name matching
- Vector search integration: modern search combines lexical search (BM25 inverted index) with dense vector search (approximate nearest neighbor) for semantic understanding — Elasticsearch HNSW vector index, Vespa, Weaviate implement hybrid BM25 + vector search