Low Level Design: Elasticsearch Internals

⏱ 7 min read

Elasticsearch as Distributed Lucene

Elasticsearch is a distributed search and analytics engine built on top of Apache Lucene. Each shard is an independent Lucene index — it can be searched and indexed in isolation. An Elasticsearch index is split into a configurable number of primary shards (fixed at index creation time) and zero or more replica shards per primary. Replicas serve read traffic (search, get) and provide redundancy — if a primary fails, a replica is promoted.

The shard count determines maximum horizontal scalability: you can have at most as many data nodes as primary shards for a given index. Replicas add read throughput linearly. A common production pattern is 1 primary + 1 replica per shard, giving N nodes for an index with N/2 primaries.

Lucene Segment Architecture

Within each shard, Lucene stores data in segments — immutable units of index data. Newly indexed documents accumulate in an in-memory buffer. When the buffer is flushed (on refresh or when it fills), a new segment is written to disk. Segments are never modified after creation — updates are implemented as a delete (mark old doc deleted in a .del file) plus an insert of the new version.

Each segment contains:

Inverted index: per-field term dictionaries and posting lists (doc IDs, positions, frequencies).
Stored fields: the original field values stored for retrieval (used when fetching _source).
Doc values: column-oriented storage of field values, used for sorting, aggregations, and scripting. Stored in a compressed, columnar format enabling efficient scan.
Norms: per-document, per-field length normalization factors used in relevance scoring.
Term vectors: optional per-document term frequency/position data for highlighting and more-like-this.

Lucene periodically merges small segments into larger ones via a configurable merge policy (TieredMergePolicy by default). Merging reduces the number of segments, improving query performance (fewer segments to search) and cleaning up deleted documents.

Document Indexing Pipeline

The full path of a document from client to durable storage:

Client → coordinating node: any node can act as coordinating node for a request — it routes the request to the correct primary shard.
Routing to primary shard: hash(_routing or _id) mod num_primary_shards determines the shard. The coordinating node forwards the document to the node holding that primary.
Write to primary: the primary shard writes the document to its translog (write-ahead log) and to the in-memory Lucene buffer. The operation is not yet searchable.
Replicate to replicas: the primary forwards the operation to all replica shards in parallel. Each replica writes to its own translog and Lucene buffer.
Acknowledge to client: once replicas acknowledge (based on wait_for_active_shards setting), the coordinating node responds to the client.

The document becomes searchable only after a refresh, not immediately after indexing. This is the source of Elasticsearch’s near-real-time (NRT) behavior.

Translog: Write-Ahead Log

Each shard maintains a translog — a write-ahead log that records every indexing operation before it is committed to a Lucene segment. If the Elasticsearch process crashes after acknowledging a write but before the Lucene index is flushed (committed), the translog allows recovery of those operations on restart.

The translog is fsynced to disk on every index operation by default (index.translog.durability: request). This can be relaxed to async fsync (async mode) for higher throughput at the cost of potential data loss on crash. The translog is truncated (a new one started) when a Lucene flush (commit) occurs — at that point the data is safely in a Lucene segment on disk.

Refresh: Making Documents Searchable

A refresh flushes the in-memory Lucene buffer to a new segment on disk and opens a new searcher over the updated segment list. After a refresh, newly indexed documents become visible to search. Refresh is distinct from flush (Lucene commit) — refresh creates a new segment but does not fsync it; the data is in the OS page cache and not yet durable.

By default, Elasticsearch refreshes every 1 second (index.refresh_interval: 1s) — this is why it is described as near-real-time rather than real-time. During bulk indexing, refresh can be set to -1 (disabled) and index.number_of_replicas set to 0 to maximize throughput, then re-enabled and replicas added afterward. The refresh=true parameter on individual index requests forces an immediate refresh at the cost of throughput.

Segment Merging

Segment merging runs in the background, combining smaller segments into larger ones. The TieredMergePolicy groups segments into tiers by size and merges within a tier. Benefits of merging: fewer segments to search (each search must visit every segment), removal of deleted document entries (freed space), and better compression of doc values and term dictionaries.

Merging is I/O intensive — it reads all source segments and writes a new segment. Elasticsearch throttles merge I/O (indices.store.throttle) to avoid impacting search performance. Force merge (POST /{index}/_forcemerge) can reduce a shard to 1 segment — beneficial for read-only indices (e.g., historical time-based indices) but should never be run on active write indices, as it can produce very large segments that are slow to merge again later.

Inverted Index and Field Data Structures

The inverted index structure varies by field type:

text field: analyzed at index time by an analyzer (character filters → tokenizer → token filters). Produces a stream of tokens. The inverted index maps each token → posting list (sorted list of doc IDs, with optional positions and frequencies). Enables full-text search with relevance scoring.
keyword field: not analyzed — exact string match. Stored in the inverted index as-is. Used for filtering, aggregations, and sorting.
numeric fields (integer, float, date): indexed using BKD trees (k-d trees for k=1), which are block-based spatial data structures. BKD trees enable efficient range queries and are much more compact than inverted indexes for numeric data.
geo_point: uses BKD trees for geo-distance and bounding box queries.

Mapping and Schema Management

Elasticsearch is schema-on-write: field types are defined in a mapping and cannot be changed for existing data without reindexing. Dynamic mapping auto-detects field types on first document insertion — convenient for development but dangerous in production. Common pitfalls:

Mapping explosion: dynamic mapping on arbitrary JSON keys (e.g., user-defined metadata) can create thousands of fields, overloading the cluster state and causing memory pressure. Use dynamic: strict or dynamic: false in production, or use flattened type for arbitrary key-value data.
text vs keyword: a field cannot be both analyzed (for full-text search) and exact (for aggregations) unless mapped as multi-field (fields: { raw: { type: keyword } }).
Reindexing: to change a field’s mapping, use the Reindex API to copy data to a new index with the corrected mapping, then switch the alias.

Query DSL Execution

A search request follows the scatter-gather pattern:

Query phase: coordinating node broadcasts the query to one copy (primary or replica) of every shard. Each shard executes the query locally, scoring documents and collecting the top-K doc IDs + scores into a priority queue. Shards return their top-K results to the coordinating node.
Fetch phase: coordinating node merges results from all shards, determines the global top-K, and sends a fetch request to the relevant shards to retrieve the full _source for those documents.
Return: coordinating node assembles the final response and returns it to the client.

Deep pagination is expensive: requesting page 1000 with size 10 requires each shard to return 10,010 results to the coordinating node, which then merges and discards 10,000. Use search_after (keyset pagination) for deep pagination, or scroll API for full dataset export.

Relevance Scoring

Elasticsearch uses BM25 (Best Match 25) as its default similarity algorithm, replacing TF-IDF in ES 5.0+. BM25 score for a term in a document:

score(t, d) = IDF(t) * (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * dl/avgdl))

Where tf is term frequency in the document, dl is document length, avgdl is average document length across the index, k1 (default 1.2) controls term frequency saturation, and b (default 0.75) controls length normalization. IDF is the inverse document frequency — terms appearing in fewer documents have higher IDF.

Scores from multiple query clauses are combined (sum for bool should, with optional boost per clause). Function score and script score queries allow injecting custom signals (recency, popularity, geo-proximity) into the final score.

Cluster Coordination

Elasticsearch 7.x+ uses a Raft-based consensus algorithm (replacing the previous Zen discovery with its split-brain vulnerabilities) for master election and cluster state management. A subset of nodes are master-eligible; one is elected master. The master is responsible for:

Maintaining and publishing the cluster state: index metadata, mappings, settings, shard allocation table.
Responding to index creation/deletion requests.
Detecting failed nodes and triggering shard reallocation.

Shard allocation awareness allows tagging nodes with attributes (rack, availability zone, region) so Elasticsearch places primaries and replicas in different failure domains. The allocation explain API (GET /_cluster/allocation/explain) diagnoses why a shard is unassigned. In large clusters, dedicated master nodes (not holding data) are recommended to insulate cluster state management from the I/O pressure of data nodes.