Question 1

How does an inverted index make full-text search fast?

Accepted Answer

An inverted index maps each unique term to the list of documents containing that term. Without it, searching for running shoes requires scanning every document -- O(N) where N is total documents. With an inverted index: look up run in the term dictionary (O(log T) where T is unique terms), get its posting list (list of document IDs containing run), look up shoe, get its posting list, intersect the two lists using a two-pointer merge (O(L1 + L2) where L1 and L2 are list lengths). For a corpus of 10 million documents where run appears in 100K documents and shoe appears in 50K, the intersection is O(150K) instead of O(10M) -- orders of magnitude faster. Construction: text is analyzed (tokenized, lowercased, stemmed), then each token-to-document mapping is added to the index. The posting list also stores term frequency and positions for relevance scoring and phrase queries. Lucene (the library underlying Elasticsearch) uses a Finite State Transducer (FST) for the term dictionary, enabling prefix and fuzzy matching in addition to exact lookups.

Question 2

How does Elasticsearch score document relevance with BM25?

Accepted Answer

BM25 (Best Matching 25) scores each document based on three factors: (1) Term Frequency (TF) -- how often the query term appears in the document. More occurrences indicate higher relevance, but with diminishing returns (the 10th occurrence adds less score than the 1st). The k1 parameter (default 1.2) controls the saturation curve. (2) Inverse Document Frequency (IDF) -- how rare the term is across all documents. Common terms (the, is) are less informative; rare terms (kubernetes, elasticsearch) are more distinctive. IDF = log(1 + (N - n + 0.5) / (n + 0.5)). (3) Field length normalization -- shorter fields score higher than longer ones for the same TF. A 5-word title containing elasticsearch is more relevant than a 10000-word article mentioning it once. The b parameter (default 0.75) controls normalization strength. In practice, you boost important fields (title weight 3x, body weight 1x) and combine text scores with business signals (popularity, recency) using function_score queries for optimal search quality.

Question 3

How do you size Elasticsearch shards for optimal performance?

Accepted Answer

Each Elasticsearch shard is a complete Lucene index with its own file descriptors, memory overhead, and thread pool usage. Shard sizing guidelines: (1) Target 10-50 GB per shard. Smaller shards create excessive overhead (thousands of tiny shards waste cluster resources). Larger shards make rebalancing slow and recovery after node failure takes longer. (2) Number of shards = estimated_index_size / target_shard_size. For a 200GB index with 25GB target: 8 primary shards. (3) Replicas: 1 replica per primary shard is standard (doubles the shard count but provides fault tolerance and doubles read throughput). (4) For time-series data (logs), use index lifecycle management: create daily indexes, each with appropriate shard count for daily volume. Roll over to cheaper storage as indexes age. (5) Avoid changing shard count after creation -- Elasticsearch does not support online reshard. Plan for growth: if you expect 5x data growth, size shards for the future volume or plan to reindex. The total shard count across the cluster should stay under 1000 per data node to avoid memory pressure.

Question 4

When should you use Elasticsearch versus a relational database for search?

Accepted Answer

Use Elasticsearch when: (1) Full-text search is a core feature -- product search, article search, log search. Elasticsearch analyzers (stemming, synonyms, fuzzy matching) provide much better search quality than SQL LIKE or even PostgreSQL full-text search. (2) You need faceted search (filter by category, price range, brand while showing counts per facet). Elasticsearch aggregations handle this efficiently. (3) You need to search across multiple fields with different weights (title matches are more important than description matches). (4) High search throughput is required (thousands of search queries per second). Use a relational database when: (1) Search is simple (exact match or prefix match on indexed columns). PostgreSQL with a B-tree or GIN index handles this well. (2) You need transactional consistency (Elasticsearch is eventually consistent -- documents are searchable approximately 1 second after indexing). (3) The dataset is small (under 1 million records). The overhead of running Elasticsearch is not justified. Common architecture: use PostgreSQL as the source of truth (writes go here) and Elasticsearch as a read-optimized search index. Sync data from PostgreSQL to Elasticsearch via CDC (Debezium) or application-level dual writes.

System Design: Elasticsearch and Full-Text Search — Inverted Index, Analyzers, Relevance Scoring, Sharding

Elasticsearch Architecture

The Inverted Index

Analyzers and Text Processing

Relevance Scoring with BM25

Scaling Elasticsearch