Search Indexer Overview
A search indexer transforms raw documents into a data structure that supports fast full-text queries. The pipeline ingests source changes, analyzes text, writes to an inverted index, and manages segment lifecycle for query performance. Near-real-time (NRT) indexing makes new content searchable within seconds without full index rebuilds.
Indexing Pipeline
- Source: Database change events via Change Data Capture (CDC) or direct API submissions trigger the pipeline.
- Document extractor: Fetches full document fields from the source or CDC payload.
- Field analyzer: Applies the text analysis chain to each indexed field.
- Index writer: Appends analyzed tokens to the inverted index.
Document Schema and Field Configuration
Each document has a defined schema specifying how each field is handled:
- Fields: id, title, body, tags[], author, created_at, updated_at.
- Index config per field: analyzed (run through text analysis chain?), stored (raw value retrievable in results?), indexed (included in inverted index for search?).
- Example:
bodyis analyzed and indexed but not stored (too large).titleis analyzed, indexed, and stored.idis indexed but not analyzed (exact match only).
Text Analysis Chain
The analysis chain converts raw text into index tokens:
- Tokenizer: Split on whitespace and punctuation boundaries.
- Lowercase filter: Normalize case so “Python” and “python” match.
- Stop word removal: Remove high-frequency words (“the”, “and”, “is”) that contribute no discrimination.
- Stemmer: Reduce words to their root form (Porter or Snowball stemmer: “running” → “run”, “indexed” → “index”).
- Output: Token stream written to inverted index.
Inverted Index Structure
The inverted index maps each unique token to its posting list:
term → [{doc_id, term_freq, positions: [3, 17, 42]}, ...]
Positions enable phrase queries (“exact phrase match”). Term frequency feeds into BM25 scoring. The posting list is stored sorted by doc_id to enable efficient boolean AND/OR operations via merge.
Near-Real-Time Indexing with Segments
Lucene's segment model enables NRT indexing without locking:
- New documents are written to an in-memory buffer.
- Every ~1 second, the buffer is committed to a new immutable segment on disk.
- The segment is immediately searchable — this is the NRT commit.
- Segments are immutable: updates are handled as delete + re-insert.
Segment Merge Policy
Many small segments degrade query performance — each query must scan all segment posting lists. A background merge process combines small segments into larger ones:
- Tiered merge policy: merge segments of similar size together.
- Merge happens offline, new merged segment replaces old segments atomically.
- Deleted documents are purged during merge (soft deletes become hard deletes).
- Merge I/O is throttled to avoid impacting query latency.
Delete Handling and Partial Updates
Deletes in a segment-based index are lazy:
- Deleted documents are marked in a per-segment deleted bitset, not removed immediately.
- Query results exclude deleted docs via bitset filtering.
- Actual removal happens during the next merge.
Partial field updates (changing one field without reindexing the whole document) require fetching stored fields for unchanged fields, merging with updated fields, and writing a new document. The old document is soft-deleted.
Index Replication
Index replication follows a primary-replica model (Elasticsearch shard model):
- Primary shard builds the index and receives all writes.
- Replicas copy segment files from the primary using file-based replication.
- Replicas serve read traffic, distributing query load.
- On primary failure, a replica is promoted to primary.
Index Aliases and Zero-Downtime Reindex
When a schema change requires a full reindex, aliases prevent downtime:
- Build the new index in the background while old index serves queries.
- Point the search alias at the new index atomically when ready.
- Delete the old index.
Index Warming
A cold replica has empty OS page cache — first queries are slow due to disk reads. Index warming pre-executes common queries against a new replica before routing traffic to it, ensuring hot data is in cache when real users hit it.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you design the document ingestion pipeline for a large-scale search indexer?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Document ingestion begins with a crawler or change-data-capture feed that emits raw documents (HTML, JSON, PDFs) into a message queue (Kafka). A parsing layer extracts structured fields: title, body text, metadata, outbound links, and canonical URL. Text normalization applies tokenization, lowercasing, stopword removal, and stemming or lemmatization. For web-scale systems, deduplication is performed using near-duplicate detection (SimHash or MinHash) before indexing to avoid index bloat. Parsed documents are written to an intermediate store (e.g., an object store like S3 or GCS) and then consumed by index builders. The pipeline is designed for horizontal scalability: each stage is a stateless consumer group that scales independently based on queue lag. Dead-letter queues capture parsing failures for manual inspection, and a schema registry enforces document field contracts across teams.”
}
},
{
“@type”: “Question”,
“name”: “How does index building work internally, and how do you construct an inverted index at scale?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An inverted index maps each term to a posting list: a sorted list of (document ID, term frequency, position list) tuples. At scale, index building uses a MapReduce or Spark job: the map phase emits (term, doc_id, tf, positions) pairs, and the reduce phase merges and sorts them into posting lists. Posting lists are encoded with delta compression (storing gaps between consecutive doc IDs rather than absolute IDs) and optionally with variable-byte or PForDelta encoding to minimize storage. The final index is partitioned into shards — either by document ID range (for balanced load) or by term hash (for term-centric routing). Each shard is stored as an immutable segment (similar to Lucene's segment model). Segment merging reduces the number of segments over time using a tiered merge policy, keeping query latency low. A document store (forward index) maps doc IDs back to stored fields for result retrieval.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement incremental index updates without full rebuilds?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Incremental updates use a write-ahead log (WAL) of document changes (inserts, updates, deletes) consumed by an indexing service. For updates, the system marks the old doc ID as deleted (a deletion bitmap) and inserts the new version with a new doc ID — a technique called soft delete. Deletions don't immediately reclaim space; they are reconciled during segment merges. Real-time visibility is achieved by maintaining a small in-memory segment (the 'hot' segment) that receives all recent writes and is queried alongside the on-disk segments. Merging the hot segment into disk periodically keeps query performance stable. For distributed systems, each shard independently manages its WAL and segment lifecycle. A global document version store (keyed by canonical URL or doc ID) prevents duplicate indexing of unchanged documents, using a content hash or last-modified timestamp as the version key.”
}
},
{
“@type”: “Question”,
“name”: “How do you design a search indexer to handle index freshness SLAs — for example, new documents visible within 60 seconds?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Meeting a sub-60-second freshness SLA requires a real-time indexing path alongside the batch pipeline. New or updated documents are published to a priority Kafka topic with low-latency consumers that parse, normalize, and write directly to the in-memory hot segment of the relevant shard. The hot segment is flushed to disk every 15-30 seconds, triggering a segment refresh that makes documents queryable. Critical path latency is monitored end-to-end: from document publication to query visibility. Each stage (parsing, feature extraction, shard routing, segment write) has a latency budget. Bottlenecks are addressed by pre-warming parser instances, using async I/O for shard writes, and co-locating the indexing service with shard leaders to avoid cross-rack network hops. A freshness dashboard tracks P50/P99 indexing latency per document type, and an SLA breach triggers an alert and automatic throttling of lower-priority batch indexing jobs to free resources for the real-time path.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Atlassian Interview Guide