What are the core components of a search index builder system design?

A search index builder typically consists of a document ingestion pipeline, a tokenizer and text normalizer, an inverted index data structure, a postings list storage layer, and an index merge/compaction process. The system must handle document parsing, term frequency computation, and efficient on-disk storage of the final index.

How do you handle index updates and deletions in a search index builder?

Updates and deletions are commonly handled using a tombstone or deletion bitmap approach combined with periodic segment merges. Lucene-style systems write new segments for updates and mark old document IDs as deleted. During a merge, deleted documents are physically removed. This avoids costly in-place rewrites of large index segments.

How do companies like Google and Amazon scale a search index builder to billions of documents?

At Google and Amazon scale, the index is sharded horizontally across thousands of machines. A MapReduce or similar batch pipeline distributes document parsing and term extraction across workers. Partial indexes are built locally and then merged. Consistent hashing or range-based sharding assigns document ranges to index shards, while a global routing layer directs queries to the correct shards.

What data structures are most important when designing a search index builder?

The inverted index is the foundational data structure, mapping each term to a postings list of document IDs and term frequencies. Skip lists or B-trees speed up postings list traversal. Bloom filters can short-circuit lookups for terms that don't exist. For ranked retrieval, additional structures store TF-IDF or BM25 scores. Compressed integer encoding (e.g., variable-byte or delta encoding) is critical for reducing disk I/O.

Low Level Design: Search Index Builder

⏱ 4 min read

What Is a Search Index Builder?

A Search Index Builder is the offline pipeline responsible for transforming raw document corpora into a queryable inverted index. It ingests documents from a distributed store, tokenizes and normalizes text, computes term frequencies, and flushes sorted posting lists to disk. The resulting index is then served by a query engine. Getting this pipeline right determines search latency, recall, and the freshness of results across a large-scale system.

Data Model / Schema

The core tables and structures that support the indexer:

-- Documents awaiting indexing
CREATE TABLE documents (
    doc_id       BIGINT PRIMARY KEY,
    url          TEXT NOT NULL,
    content_hash CHAR(64) NOT NULL,
    raw_body     MEDIUMTEXT,
    lang         VARCHAR(10),
    status       ENUM('pending', 'indexed', 'failed'),
    created_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    indexed_at   TIMESTAMP
);

-- Inverted index posting list
CREATE TABLE postings (
    term         VARCHAR(256) NOT NULL,
    doc_id       BIGINT NOT NULL,
    term_freq    INT NOT NULL,
    positions    BLOB,          -- varint-encoded position list
    PRIMARY KEY (term, doc_id)
);

-- Segment metadata for merge tracking
CREATE TABLE segments (
    segment_id   INT PRIMARY KEY AUTO_INCREMENT,
    path         TEXT NOT NULL,
    doc_count    INT,
    size_bytes   BIGINT,
    merged       BOOLEAN DEFAULT FALSE,
    created_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Core Algorithm and Workflow

The pipeline runs in four stages:

Ingest: A coordinator pulls batches of pending documents from the queue (Kafka topic or database cursor). Each worker claims a shard by setting status = 'indexed' atomically with an optimistic lock.
Parse and Tokenize: Raw HTML is stripped; body text is lowercased, stop-words removed, and stems computed (Porter or Snowball stemmer). Each (term, doc_id, position) triple is emitted to an in-memory buffer.
Sort and Flush: When the buffer reaches a configurable threshold (e.g., 256 MB), it is sorted by term then by doc_id and flushed as an immutable segment file on local SSD. Each segment is a sorted run of posting records.
Merge: A background merger applies an N-way merge sort across segments (similar to LSM-tree compaction). Merged segments are uploaded to distributed object storage (S3 or GCS) and registered in the segments table.

Term frequency normalization uses TF-IDF or BM25 weighting computed incrementally during flush. Document vectors for dense retrieval (ANN search) are generated separately by an embedding model and stored in a vector store.

Failure Handling

Worker crash mid-flush: Segments are written to a temp path and renamed atomically. Incomplete segments are detected by a missing checksum footer and discarded on restart.
Duplicate indexing: The content_hash field allows idempotent re-indexing; if the hash matches the stored value, the worker skips reprocessing.
Merge failure: Merge is idempotent. Input segments are never deleted until the merged output is confirmed; the coordinator retries failed merges from the last registered checkpoint.
Poison documents: Documents that repeatedly cause worker crashes are quarantined by incrementing a retry_count column; after three failures, status is set to failed and an alert is emitted.

Scalability Considerations

Horizontal sharding: Partition the document corpus by doc_id % N across N indexer nodes. Each node owns its shard independently; the query layer fans out and merges ranked results.
Incremental indexing: Maintain a changelog (CDC from the documents table) so only modified documents re-enter the pipeline. Avoid full rebuilds except for major schema changes.
Segment tiering: Keep hot, frequently merged small segments on local NVMe; promote cold large segments to object storage. This minimizes merge I/O amplification.
Backpressure: If the merge queue depth exceeds a watermark, ingest workers slow down to prevent unbounded segment accumulation and memory pressure.

Summary

A well-designed Search Index Builder separates concerns cleanly: ingest, parse, flush, and merge are independent stages with well-defined failure boundaries. Using immutable segments, atomic renames, and content-hash deduplication makes the pipeline robust and restartable. At scale, shard the corpus horizontally and adopt incremental indexing via CDC to keep index freshness high without costly full rebuilds.