Search Indexer Low-Level Design: Document Ingestion, Index Building, and Incremental Updates

Search Indexer Overview

A search indexer transforms raw documents into a data structure that supports fast full-text queries. The pipeline ingests source changes, analyzes text, writes to an inverted index, and manages segment lifecycle for query performance. Near-real-time (NRT) indexing makes new content searchable within seconds without full index rebuilds.

Indexing Pipeline

  1. Source: Database change events via Change Data Capture (CDC) or direct API submissions trigger the pipeline.
  2. Document extractor: Fetches full document fields from the source or CDC payload.
  3. Field analyzer: Applies the text analysis chain to each indexed field.
  4. Index writer: Appends analyzed tokens to the inverted index.

Document Schema and Field Configuration

Each document has a defined schema specifying how each field is handled:

  • Fields: id, title, body, tags[], author, created_at, updated_at.
  • Index config per field: analyzed (run through text analysis chain?), stored (raw value retrievable in results?), indexed (included in inverted index for search?).
  • Example: body is analyzed and indexed but not stored (too large). title is analyzed, indexed, and stored. id is indexed but not analyzed (exact match only).

Text Analysis Chain

The analysis chain converts raw text into index tokens:

  1. Tokenizer: Split on whitespace and punctuation boundaries.
  2. Lowercase filter: Normalize case so “Python” and “python” match.
  3. Stop word removal: Remove high-frequency words (“the”, “and”, “is”) that contribute no discrimination.
  4. Stemmer: Reduce words to their root form (Porter or Snowball stemmer: “running” → “run”, “indexed” → “index”).
  5. Output: Token stream written to inverted index.

Inverted Index Structure

The inverted index maps each unique token to its posting list:

term → [{doc_id, term_freq, positions: [3, 17, 42]}, ...]

Positions enable phrase queries (“exact phrase match”). Term frequency feeds into BM25 scoring. The posting list is stored sorted by doc_id to enable efficient boolean AND/OR operations via merge.

Near-Real-Time Indexing with Segments

Lucene's segment model enables NRT indexing without locking:

  • New documents are written to an in-memory buffer.
  • Every ~1 second, the buffer is committed to a new immutable segment on disk.
  • The segment is immediately searchable — this is the NRT commit.
  • Segments are immutable: updates are handled as delete + re-insert.

Segment Merge Policy

Many small segments degrade query performance — each query must scan all segment posting lists. A background merge process combines small segments into larger ones:

  • Tiered merge policy: merge segments of similar size together.
  • Merge happens offline, new merged segment replaces old segments atomically.
  • Deleted documents are purged during merge (soft deletes become hard deletes).
  • Merge I/O is throttled to avoid impacting query latency.

Delete Handling and Partial Updates

Deletes in a segment-based index are lazy:

  • Deleted documents are marked in a per-segment deleted bitset, not removed immediately.
  • Query results exclude deleted docs via bitset filtering.
  • Actual removal happens during the next merge.

Partial field updates (changing one field without reindexing the whole document) require fetching stored fields for unchanged fields, merging with updated fields, and writing a new document. The old document is soft-deleted.

Index Replication

Index replication follows a primary-replica model (Elasticsearch shard model):

  • Primary shard builds the index and receives all writes.
  • Replicas copy segment files from the primary using file-based replication.
  • Replicas serve read traffic, distributing query load.
  • On primary failure, a replica is promoted to primary.

Index Aliases and Zero-Downtime Reindex

When a schema change requires a full reindex, aliases prevent downtime:

  1. Build the new index in the background while old index serves queries.
  2. Point the search alias at the new index atomically when ready.
  3. Delete the old index.

Index Warming

A cold replica has empty OS page cache — first queries are slow due to disk reads. Index warming pre-executes common queries against a new replica before routing traffic to it, ensuring hot data is in cache when real users hit it.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Atlassian Interview Guide

Scroll to Top