Question 1

How do you design the document ingestion pipeline for a large-scale search indexer?

Accepted Answer

Document ingestion begins with a crawler or change-data-capture feed that emits raw documents (HTML, JSON, PDFs) into a message queue (Kafka). A parsing layer extracts structured fields: title, body text, metadata, outbound links, and canonical URL. Text normalization applies tokenization, lowercasing, stopword removal, and stemming or lemmatization. For web-scale systems, deduplication is performed using near-duplicate detection (SimHash or MinHash) before indexing to avoid index bloat. Parsed documents are written to an intermediate store (e.g., an object store like S3 or GCS) and then consumed by index builders. The pipeline is designed for horizontal scalability: each stage is a stateless consumer group that scales independently based on queue lag. Dead-letter queues capture parsing failures for manual inspection, and a schema registry enforces document field contracts across teams.

Question 2

How does index building work internally, and how do you construct an inverted index at scale?

Accepted Answer

An inverted index maps each term to a posting list: a sorted list of (document ID, term frequency, position list) tuples. At scale, index building uses a MapReduce or Spark job: the map phase emits (term, doc_id, tf, positions) pairs, and the reduce phase merges and sorts them into posting lists. Posting lists are encoded with delta compression (storing gaps between consecutive doc IDs rather than absolute IDs) and optionally with variable-byte or PForDelta encoding to minimize storage. The final index is partitioned into shards — either by document ID range (for balanced load) or by term hash (for term-centric routing). Each shard is stored as an immutable segment (similar to Lucene's segment model). Segment merging reduces the number of segments over time using a tiered merge policy, keeping query latency low. A document store (forward index) maps doc IDs back to stored fields for result retrieval.

Question 3

How do you implement incremental index updates without full rebuilds?

Accepted Answer

Incremental updates use a write-ahead log (WAL) of document changes (inserts, updates, deletes) consumed by an indexing service. For updates, the system marks the old doc ID as deleted (a deletion bitmap) and inserts the new version with a new doc ID — a technique called soft delete. Deletions don't immediately reclaim space; they are reconciled during segment merges. Real-time visibility is achieved by maintaining a small in-memory segment (the 'hot' segment) that receives all recent writes and is queried alongside the on-disk segments. Merging the hot segment into disk periodically keeps query performance stable. For distributed systems, each shard independently manages its WAL and segment lifecycle. A global document version store (keyed by canonical URL or doc ID) prevents duplicate indexing of unchanged documents, using a content hash or last-modified timestamp as the version key.

Question 4

How do you design a search indexer to handle index freshness SLAs — for example, new documents visible within 60 seconds?

Accepted Answer

Meeting a sub-60-second freshness SLA requires a real-time indexing path alongside the batch pipeline. New or updated documents are published to a priority Kafka topic with low-latency consumers that parse, normalize, and write directly to the in-memory hot segment of the relevant shard. The hot segment is flushed to disk every 15-30 seconds, triggering a segment refresh that makes documents queryable. Critical path latency is monitored end-to-end: from document publication to query visibility. Each stage (parsing, feature extraction, shard routing, segment write) has a latency budget. Bottlenecks are addressed by pre-warming parser instances, using async I/O for shard writes, and co-locating the indexing service with shard leaders to avoid cross-rack network hops. A freshness dashboard tracks P50/P99 indexing latency per document type, and an SLA breach triggers an alert and automatic throttling of lower-priority batch indexing jobs to free resources for the real-time path.

Search Indexer Low-Level Design: Document Ingestion, Index Building, and Incremental Updates

Search Indexer Overview

Indexing Pipeline

Document Schema and Field Configuration

Text Analysis Chain

Inverted Index Structure

Near-Real-Time Indexing with Segments

Segment Merge Policy

Delete Handling and Partial Updates

Index Replication

Index Aliases and Zero-Downtime Reindex

Index Warming