Full-text search is a core feature of most applications — from e-commerce product search to log analysis to knowledge base search. Elasticsearch is the dominant open-source search engine, powering search at Wikipedia, GitHub, Stack Overflow, and Netflix. This guide covers Elasticsearch architecture, inverted index internals, relevance scoring, and scaling strategies — essential knowledge for system design interviews involving search functionality.
Elasticsearch Architecture
Elasticsearch is a distributed search and analytics engine built on Apache Lucene. Core concepts: (1) Index — a collection of documents with similar characteristics (like a database table). An e-commerce app might have a products index and a reviews index. (2) Document — a JSON object stored in an index (like a database row). Each document has an _id and fields. (3) Shard — an index is divided into shards for horizontal scaling. Each shard is a complete Lucene index. A products index with 5 primary shards distributes data across 5 Lucene instances. (4) Replica — each primary shard has one or more replica shards for fault tolerance and read scaling. A setup of 5 primary shards with 1 replica = 10 total shards. Queries can be served by either primary or replica shards. Cluster topology: nodes can be configured as master-eligible (manage cluster state), data nodes (store shards and execute queries), coordinating nodes (route requests and merge results), or ingest nodes (pre-process documents). A production cluster typically has 3 master-eligible nodes and multiple data nodes.
The Inverted Index
The inverted index is the core data structure that makes full-text search fast. It maps each unique term to the list of documents containing that term. Construction: (1) Analysis — the document text is processed through an analyzer: tokenizer (split text into tokens: “running shoes” -> [“running”, “shoes”]), token filters (lowercase: “Running” -> “running”, stemming: “running” -> “run”, stop words removal: remove “the”, “is”, “a”). (2) Indexing — for each token, add the document ID to the token posting list. The posting list for “run” might be: [doc1 (tf:3, positions:[5,23,89]), doc4 (tf:1, positions:[12])]. (3) Storage — the inverted index is stored as sorted term -> posting list entries on disk, with a term dictionary for fast lookup (FST – Finite State Transducer in Lucene). Query execution: for a query “running shoes”, analyze it with the same analyzer (-> “run”, “shoe”), look up each term in the inverted index, intersect the posting lists (documents containing both “run” AND “shoe”), score each document by relevance, and return the top-K results. This is O(L1 + L2) for the intersection, where L1 and L2 are posting list lengths — much faster than scanning all documents.
Analyzers and Text Processing
Analyzers determine how text is broken into searchable terms. An analyzer consists of: character filters (strip HTML tags, convert special characters), a tokenizer (split text into tokens), and token filters (transform tokens). Built-in analyzers: (1) Standard analyzer — lowercase, Unicode tokenization, remove punctuation. Good for most Western languages. (2) English analyzer — standard + English stop words removal + English stemming (running -> run, better -> better). (3) Keyword analyzer — the entire field value is a single token (no splitting). Use for: email addresses, product SKUs, zip codes. Custom analyzers: for e-commerce search, create an analyzer that handles: synonyms (laptop = notebook), edge ngrams (typ -> type -> types for autocomplete), and phonetic matching (Smith sounds like Smyth). Analyzer choice dramatically affects search quality. Mapping configuration: each field in the index mapping specifies its analyzer. The title field might use the English analyzer while the sku field uses the keyword analyzer. The same analyzer must be used at both index time and query time for consistent results.
Relevance Scoring with BM25
Elasticsearch uses BM25 (Best Matching 25) to score document relevance. BM25 considers: (1) Term Frequency (TF) — how often the query term appears in the document. More occurrences = higher relevance, with diminishing returns (the 10th occurrence adds less than the 1st). (2) Inverse Document Frequency (IDF) — how rare the term is across all documents. Rare terms (elasticsearch) are more informative than common terms (the). IDF = log(1 + (N – n + 0.5) / (n + 0.5)), where N is total documents and n is documents containing the term. (3) Field length normalization — shorter documents score higher than longer documents for the same term frequency. A title containing “elasticsearch” is more relevant than a 10,000-word article mentioning it once. BM25 parameters: k1 (default 1.2, controls TF saturation) and b (default 0.75, controls length normalization). Boosting: multiply field scores by a weight: title matches are 3x more important than body matches. Function score queries: combine text relevance with business signals (product popularity, recency, user preferences). A product search might combine BM25 text score with sales_rank boost and freshness decay.
Scaling Elasticsearch
Shard sizing: each shard should be 10-50 GB for optimal performance. Too small (many tiny shards) creates overhead from per-shard memory and file descriptors. Too large (few huge shards) makes rebalancing slow and recovery after failure time-consuming. For a 500 GB index: 10-50 primary shards. Index lifecycle management (ILM): for time-series data (logs, events), create daily or weekly indexes (logs-2026-04-20). Older indexes are moved to cheaper storage (warm nodes with HDDs), then eventually deleted. ILM automates this: hot phase (SSD, recent data), warm phase (HDD, older data), cold phase (compressed, rarely accessed), delete phase. Query routing: a search query is sent to a coordinating node, which forwards it to one copy of each shard (primary or replica). Each shard returns its top-K results. The coordinating node merges and re-ranks the results. With 10 shards and 2 replicas: 10 shards are queried in parallel. Near-real-time search: documents are searchable within 1 second of indexing (the Lucene refresh interval). For truly real-time requirements, reduce the refresh interval (at the cost of higher indexing overhead).