Full-text search is a core feature of most applications — from e-commerce product search to log analysis to knowledge base search. Elasticsearch is the dominant open-source search engine, powering search at Wikipedia, GitHub, Stack Overflow, and Netflix. This guide covers Elasticsearch architecture, inverted index internals, relevance scoring, and scaling strategies — essential knowledge for system design interviews involving search functionality.
Elasticsearch Architecture
Elasticsearch is a distributed search and analytics engine built on Apache Lucene. Core concepts: (1) Index — a collection of documents with similar characteristics (like a database table). An e-commerce app might have a products index and a reviews index. (2) Document — a JSON object stored in an index (like a database row). Each document has an _id and fields. (3) Shard — an index is divided into shards for horizontal scaling. Each shard is a complete Lucene index. A products index with 5 primary shards distributes data across 5 Lucene instances. (4) Replica — each primary shard has one or more replica shards for fault tolerance and read scaling. A setup of 5 primary shards with 1 replica = 10 total shards. Queries can be served by either primary or replica shards. Cluster topology: nodes can be configured as master-eligible (manage cluster state), data nodes (store shards and execute queries), coordinating nodes (route requests and merge results), or ingest nodes (pre-process documents). A production cluster typically has 3 master-eligible nodes and multiple data nodes.
The Inverted Index
The inverted index is the core data structure that makes full-text search fast. It maps each unique term to the list of documents containing that term. Construction: (1) Analysis — the document text is processed through an analyzer: tokenizer (split text into tokens: “running shoes” -> [“running”, “shoes”]), token filters (lowercase: “Running” -> “running”, stemming: “running” -> “run”, stop words removal: remove “the”, “is”, “a”). (2) Indexing — for each token, add the document ID to the token posting list. The posting list for “run” might be: [doc1 (tf:3, positions:[5,23,89]), doc4 (tf:1, positions:[12])]. (3) Storage — the inverted index is stored as sorted term -> posting list entries on disk, with a term dictionary for fast lookup (FST – Finite State Transducer in Lucene). Query execution: for a query “running shoes”, analyze it with the same analyzer (-> “run”, “shoe”), look up each term in the inverted index, intersect the posting lists (documents containing both “run” AND “shoe”), score each document by relevance, and return the top-K results. This is O(L1 + L2) for the intersection, where L1 and L2 are posting list lengths — much faster than scanning all documents.
Analyzers and Text Processing
Analyzers determine how text is broken into searchable terms. An analyzer consists of: character filters (strip HTML tags, convert special characters), a tokenizer (split text into tokens), and token filters (transform tokens). Built-in analyzers: (1) Standard analyzer — lowercase, Unicode tokenization, remove punctuation. Good for most Western languages. (2) English analyzer — standard + English stop words removal + English stemming (running -> run, better -> better). (3) Keyword analyzer — the entire field value is a single token (no splitting). Use for: email addresses, product SKUs, zip codes. Custom analyzers: for e-commerce search, create an analyzer that handles: synonyms (laptop = notebook), edge ngrams (typ -> type -> types for autocomplete), and phonetic matching (Smith sounds like Smyth). Analyzer choice dramatically affects search quality. Mapping configuration: each field in the index mapping specifies its analyzer. The title field might use the English analyzer while the sku field uses the keyword analyzer. The same analyzer must be used at both index time and query time for consistent results.
Relevance Scoring with BM25
Elasticsearch uses BM25 (Best Matching 25) to score document relevance. BM25 considers: (1) Term Frequency (TF) — how often the query term appears in the document. More occurrences = higher relevance, with diminishing returns (the 10th occurrence adds less than the 1st). (2) Inverse Document Frequency (IDF) — how rare the term is across all documents. Rare terms (elasticsearch) are more informative than common terms (the). IDF = log(1 + (N – n + 0.5) / (n + 0.5)), where N is total documents and n is documents containing the term. (3) Field length normalization — shorter documents score higher than longer documents for the same term frequency. A title containing “elasticsearch” is more relevant than a 10,000-word article mentioning it once. BM25 parameters: k1 (default 1.2, controls TF saturation) and b (default 0.75, controls length normalization). Boosting: multiply field scores by a weight: title matches are 3x more important than body matches. Function score queries: combine text relevance with business signals (product popularity, recency, user preferences). A product search might combine BM25 text score with sales_rank boost and freshness decay.
Scaling Elasticsearch
Shard sizing: each shard should be 10-50 GB for optimal performance. Too small (many tiny shards) creates overhead from per-shard memory and file descriptors. Too large (few huge shards) makes rebalancing slow and recovery after failure time-consuming. For a 500 GB index: 10-50 primary shards. Index lifecycle management (ILM): for time-series data (logs, events), create daily or weekly indexes (logs-2026-04-20). Older indexes are moved to cheaper storage (warm nodes with HDDs), then eventually deleted. ILM automates this: hot phase (SSD, recent data), warm phase (HDD, older data), cold phase (compressed, rarely accessed), delete phase. Query routing: a search query is sent to a coordinating node, which forwards it to one copy of each shard (primary or replica). Each shard returns its top-K results. The coordinating node merges and re-ranks the results. With 10 shards and 2 replicas: 10 shards are queried in parallel. Near-real-time search: documents are searchable within 1 second of indexing (the Lucene refresh interval). For truly real-time requirements, reduce the refresh interval (at the cost of higher indexing overhead).
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does an inverted index make full-text search fast?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”An inverted index maps each unique term to the list of documents containing that term. Without it, searching for running shoes requires scanning every document — O(N) where N is total documents. With an inverted index: look up run in the term dictionary (O(log T) where T is unique terms), get its posting list (list of document IDs containing run), look up shoe, get its posting list, intersect the two lists using a two-pointer merge (O(L1 + L2) where L1 and L2 are list lengths). For a corpus of 10 million documents where run appears in 100K documents and shoe appears in 50K, the intersection is O(150K) instead of O(10M) — orders of magnitude faster. Construction: text is analyzed (tokenized, lowercased, stemmed), then each token-to-document mapping is added to the index. The posting list also stores term frequency and positions for relevance scoring and phrase queries. Lucene (the library underlying Elasticsearch) uses a Finite State Transducer (FST) for the term dictionary, enabling prefix and fuzzy matching in addition to exact lookups.”}},{“@type”:”Question”,”name”:”How does Elasticsearch score document relevance with BM25?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”BM25 (Best Matching 25) scores each document based on three factors: (1) Term Frequency (TF) — how often the query term appears in the document. More occurrences indicate higher relevance, but with diminishing returns (the 10th occurrence adds less score than the 1st). The k1 parameter (default 1.2) controls the saturation curve. (2) Inverse Document Frequency (IDF) — how rare the term is across all documents. Common terms (the, is) are less informative; rare terms (kubernetes, elasticsearch) are more distinctive. IDF = log(1 + (N – n + 0.5) / (n + 0.5)). (3) Field length normalization — shorter fields score higher than longer ones for the same TF. A 5-word title containing elasticsearch is more relevant than a 10000-word article mentioning it once. The b parameter (default 0.75) controls normalization strength. In practice, you boost important fields (title weight 3x, body weight 1x) and combine text scores with business signals (popularity, recency) using function_score queries for optimal search quality.”}},{“@type”:”Question”,”name”:”How do you size Elasticsearch shards for optimal performance?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Each Elasticsearch shard is a complete Lucene index with its own file descriptors, memory overhead, and thread pool usage. Shard sizing guidelines: (1) Target 10-50 GB per shard. Smaller shards create excessive overhead (thousands of tiny shards waste cluster resources). Larger shards make rebalancing slow and recovery after node failure takes longer. (2) Number of shards = estimated_index_size / target_shard_size. For a 200GB index with 25GB target: 8 primary shards. (3) Replicas: 1 replica per primary shard is standard (doubles the shard count but provides fault tolerance and doubles read throughput). (4) For time-series data (logs), use index lifecycle management: create daily indexes, each with appropriate shard count for daily volume. Roll over to cheaper storage as indexes age. (5) Avoid changing shard count after creation — Elasticsearch does not support online reshard. Plan for growth: if you expect 5x data growth, size shards for the future volume or plan to reindex. The total shard count across the cluster should stay under 1000 per data node to avoid memory pressure.”}},{“@type”:”Question”,”name”:”When should you use Elasticsearch versus a relational database for search?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use Elasticsearch when: (1) Full-text search is a core feature — product search, article search, log search. Elasticsearch analyzers (stemming, synonyms, fuzzy matching) provide much better search quality than SQL LIKE or even PostgreSQL full-text search. (2) You need faceted search (filter by category, price range, brand while showing counts per facet). Elasticsearch aggregations handle this efficiently. (3) You need to search across multiple fields with different weights (title matches are more important than description matches). (4) High search throughput is required (thousands of search queries per second). Use a relational database when: (1) Search is simple (exact match or prefix match on indexed columns). PostgreSQL with a B-tree or GIN index handles this well. (2) You need transactional consistency (Elasticsearch is eventually consistent — documents are searchable approximately 1 second after indexing). (3) The dataset is small (under 1 million records). The overhead of running Elasticsearch is not justified. Common architecture: use PostgreSQL as the source of truth (writes go here) and Elasticsearch as a read-optimized search index. Sync data from PostgreSQL to Elasticsearch via CDC (Debezium) or application-level dual writes.”}}]}