Low Level Design: Search Index Builder

What Is a Search Index Builder?

A Search Index Builder is the offline pipeline responsible for transforming raw document corpora into a queryable inverted index. It ingests documents from a distributed store, tokenizes and normalizes text, computes term frequencies, and flushes sorted posting lists to disk. The resulting index is then served by a query engine. Getting this pipeline right determines search latency, recall, and the freshness of results across a large-scale system.

Data Model / Schema

The core tables and structures that support the indexer:

-- Documents awaiting indexing
CREATE TABLE documents (
    doc_id       BIGINT PRIMARY KEY,
    url          TEXT NOT NULL,
    content_hash CHAR(64) NOT NULL,
    raw_body     MEDIUMTEXT,
    lang         VARCHAR(10),
    status       ENUM('pending', 'indexed', 'failed'),
    created_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    indexed_at   TIMESTAMP
);

-- Inverted index posting list
CREATE TABLE postings (
    term         VARCHAR(256) NOT NULL,
    doc_id       BIGINT NOT NULL,
    term_freq    INT NOT NULL,
    positions    BLOB,          -- varint-encoded position list
    PRIMARY KEY (term, doc_id)
);

-- Segment metadata for merge tracking
CREATE TABLE segments (
    segment_id   INT PRIMARY KEY AUTO_INCREMENT,
    path         TEXT NOT NULL,
    doc_count    INT,
    size_bytes   BIGINT,
    merged       BOOLEAN DEFAULT FALSE,
    created_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Core Algorithm and Workflow

The pipeline runs in four stages:

  1. Ingest: A coordinator pulls batches of pending documents from the queue (Kafka topic or database cursor). Each worker claims a shard by setting status = 'indexed' atomically with an optimistic lock.
  2. Parse and Tokenize: Raw HTML is stripped; body text is lowercased, stop-words removed, and stems computed (Porter or Snowball stemmer). Each (term, doc_id, position) triple is emitted to an in-memory buffer.
  3. Sort and Flush: When the buffer reaches a configurable threshold (e.g., 256 MB), it is sorted by term then by doc_id and flushed as an immutable segment file on local SSD. Each segment is a sorted run of posting records.
  4. Merge: A background merger applies an N-way merge sort across segments (similar to LSM-tree compaction). Merged segments are uploaded to distributed object storage (S3 or GCS) and registered in the segments table.

Term frequency normalization uses TF-IDF or BM25 weighting computed incrementally during flush. Document vectors for dense retrieval (ANN search) are generated separately by an embedding model and stored in a vector store.

Failure Handling

  • Worker crash mid-flush: Segments are written to a temp path and renamed atomically. Incomplete segments are detected by a missing checksum footer and discarded on restart.
  • Duplicate indexing: The content_hash field allows idempotent re-indexing; if the hash matches the stored value, the worker skips reprocessing.
  • Merge failure: Merge is idempotent. Input segments are never deleted until the merged output is confirmed; the coordinator retries failed merges from the last registered checkpoint.
  • Poison documents: Documents that repeatedly cause worker crashes are quarantined by incrementing a retry_count column; after three failures, status is set to failed and an alert is emitted.

Scalability Considerations

  • Horizontal sharding: Partition the document corpus by doc_id % N across N indexer nodes. Each node owns its shard independently; the query layer fans out and merges ranked results.
  • Incremental indexing: Maintain a changelog (CDC from the documents table) so only modified documents re-enter the pipeline. Avoid full rebuilds except for major schema changes.
  • Segment tiering: Keep hot, frequently merged small segments on local NVMe; promote cold large segments to object storage. This minimizes merge I/O amplification.
  • Backpressure: If the merge queue depth exceeds a watermark, ingest workers slow down to prevent unbounded segment accumulation and memory pressure.

Summary

A well-designed Search Index Builder separates concerns cleanly: ingest, parse, flush, and merge are independent stages with well-defined failure boundaries. Using immutable segments, atomic renames, and content-hash deduplication makes the pipeline robust and restartable. At scale, shard the corpus horizontally and adopt incremental indexing via CDC to keep index freshness high without costly full rebuilds.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top