Elasticsearch powers search at Wikipedia, GitHub, Stack Overflow, and Netflix, handling billions of documents with sub-second query latency. While our earlier Elasticsearch guide covered usage, this guide dives into designing the cluster itself — how data is distributed, replicated, indexed, and queried across hundreds of nodes. Essential for infrastructure-focused system design interviews.
Cluster Topology and Node Roles
An Elasticsearch cluster has specialized node roles: (1) Master-eligible nodes (3 or 5) — manage cluster state: index metadata, shard allocation, node membership. Uses a Raft-like consensus protocol. Lightweight workload — small machines with reliable storage. (2) Data nodes — store shards and execute search/indexing operations. CPU, memory, and disk intensive. The workhorse nodes — scale horizontally by adding more data nodes. (3) Coordinating nodes — receive client requests, forward to relevant data nodes, merge results, and return. No data storage. Useful for separating query routing from data processing. (4) Ingest nodes — pre-process documents before indexing (parse JSON, extract fields, run enrichment pipelines). Can be combined with data nodes for small clusters. Production cluster: 3 dedicated master nodes + N data nodes (sized for data volume and query load) + optional coordinating/ingest nodes. For a 10 TB index with 50K queries/sec: approximately 20-30 data nodes (each with 64 GB RAM, NVMe SSD, 16 cores).
Index, Shard, and Replica Design
An index is divided into primary shards, each replicated. Shard sizing: target 10-50 GB per shard. Too small (many tiny shards): excessive overhead per shard (memory for segment metadata, file descriptors, thread pools). A cluster with 100K tiny shards wastes resources. Too large (few huge shards): slow recovery on node failure (the entire shard must be copied), uneven load distribution, and slow force-merge operations. Formula: shard_count = expected_index_size / target_shard_size. For 500 GB index at 25 GB target: 20 primary shards. Replicas: 1 replica per primary shard is standard. Each primary + replica pair provides: fault tolerance (if the node with the primary fails, the replica promotes) and read scaling (queries can be served by either primary or replica). Total shards: 20 primaries * 2 (primary + 1 replica) = 40 shards distributed across data nodes. Shard allocation: Elasticsearch balances shards across nodes automatically. Primary and its replica are always on different nodes. Rack-awareness: configure allocation.awareness.attributes to spread replicas across availability zones. Index lifecycle: for time-series data (logs), create daily or weekly indices (logs-2026-04-20). Use ILM (Index Lifecycle Management) to transition: hot (SSD, recent) -> warm (HDD, older) -> cold (compressed, rarely accessed) -> delete.
Write Path: Near-Real-Time Indexing
Elasticsearch uses Lucene internally. Each shard is a Lucene index. Write path: (1) The coordinating node receives the index request, routes to the correct primary shard (hash(doc_id) % num_primary_shards). (2) The primary shard writes the document to an in-memory buffer and the translog (write-ahead log for durability). (3) Every 1 second (default refresh_interval), the buffer is flushed to a new Lucene segment on disk. The new segment is immediately searchable. This is “near-real-time” — documents are searchable within 1 second of indexing. (4) The primary shard replicates the operation to all replica shards. The write is acknowledged to the client only after all in-sync replicas confirm. (5) Periodically, Lucene merges small segments into larger ones (background merge). Merging reclaims space from deleted documents and improves query performance (fewer segments to search). Translog: if the node crashes before a flush, replay the translog to recover uncommitted documents. The translog is fsynced on every write (by default) for durability. Bulk indexing: for high-throughput ingestion, use the _bulk API to send hundreds of documents per request. This amortizes network and indexing overhead. Typical bulk size: 5-15 MB or 1000-5000 documents per request.
Query Path: Distributed Search
A search query is executed across all shards of the target index. Two-phase execution: (1) Query phase — the coordinating node forwards the query to one copy (primary or replica) of each shard. Each shard executes the query against its local Lucene index and returns the top-K document IDs with scores (not full documents). The coordinating node merges the results from all shards, sorting by score, and selects the global top-K. (2) Fetch phase — the coordinating node requests the full documents for the global top-K from the relevant shards. Only the documents needed for the response page are fetched. This two-phase approach avoids transferring full documents from all shards for queries with many matches but small result pages. Performance factors: (1) Number of shards queried — each shard is a separate Lucene search. More shards = more fan-out = higher latency and CPU. (2) Query complexity — full-text queries with many terms, nested queries, and aggregations are more expensive. (3) Segment count per shard — fewer, larger segments are faster to query (fewer segments to merge results from). Force-merge reduces segments but is expensive. (4) Caching — Elasticsearch caches filter results and frequently accessed query results. Cache invalidation on each refresh (every 1 second for active indices).
Aggregations and Analytics
Elasticsearch aggregations compute analytics over the data: (1) Metric aggregations: avg, sum, min, max, cardinality (HyperLogLog), percentiles (t-digest). (2) Bucket aggregations: terms (group by field value — top 10 categories), date_histogram (group by time interval — orders per hour), range (group by value range — price brackets), and filters. (3) Pipeline aggregations: derivative (rate of change), moving_avg, cumulative_sum. Applied to the output of other aggregations. Aggregations are computed during the query phase: each shard computes its local aggregation result. The coordinating node merges shard-level results into the global aggregation. For terms aggregations: each shard returns its top-N terms. The coordinating node merges and re-ranks. This can be inaccurate for long-tail terms (a term that is #11 on each shard may actually be #1 globally but was not returned by any shard). Increase shard_size to improve accuracy at the cost of more data transfer. doc_values: aggregations and sorting operate on doc_values — a columnar representation of field values stored on disk. Doc_values are built at index time and memory-mapped for fast access. Fields that are only searched (not aggregated or sorted) can disable doc_values to save disk space.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does Elasticsearch achieve near-real-time search?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Documents are searchable within 1 second of indexing — not instantly, hence near-real-time. The write path: (1) Document is written to an in-memory buffer and a translog (WAL for durability). (2) Every 1 second (refresh_interval), the buffer is flushed to a new Lucene segment on disk. The segment is immediately searchable. (3) The translog ensures durability: if the node crashes before flush, replay the translog. (4) Primary shard replicates to all replica shards. Write acknowledged after all in-sync replicas confirm. Background segment merging combines small segments into larger ones, improving query performance and reclaiming space from deleted documents. For bulk ingestion: increase refresh_interval to 30 seconds during import, then reset to 1 second. This dramatically improves indexing throughput by reducing segment creation frequency.”}},{“@type”:”Question”,”name”:”How does distributed search work across Elasticsearch shards?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Two-phase execution: (1) Query phase — the coordinating node forwards the query to one copy (primary or replica) of each shard. Each shard searches its local Lucene index and returns top-K document IDs with scores (not full documents). The coordinator merges results from all shards and selects the global top-K. (2) Fetch phase — the coordinator requests full documents for only the global top-K from the relevant shards. This avoids transferring full documents from all shards when most will not make the final result page. Performance depends on: number of shards (more = more fan-out), query complexity, segment count per shard (fewer large segments are faster), and caching (filter and query result caches, invalidated on each refresh).”}}]}