What is a Search Index?
A search index enables fast full-text queries over large datasets. Without an index, searching 1 billion documents for “buy laptop” requires scanning all documents — O(n). With an inverted index, the query is answered in O(k) where k is the result count. Elasticsearch, Solr, and Lucene are built on inverted indexes. Building a search index involves: document ingestion, text analysis (tokenization, normalization), inverted index construction, and ranking (BM25, TF-IDF).
Architecture
Data Sources (DB, Kafka events) → Indexing Pipeline
→ Text Analysis (tokenize, lowercase, stem, stop words)
→ Elasticsearch Index
→ Serving Layer (Search API)
→ Query parsing
→ Elasticsearch query
→ Re-ranking (ML model)
→ Results returned to client
Inverted Index Basics
An inverted index maps each term to the list of documents containing it:
Document 1: "buy laptop deals" Document 2: "best laptop price" Document 3: "buy desktop computer" Inverted Index: "buy" → [doc1, doc3] "laptop" → [doc1, doc2] "deals" → [doc1] "best" → [doc2] "price" → [doc2] "desktop" → [doc3] Query "buy laptop" → intersect([doc1, doc3], [doc1, doc2]) → [doc1]
Text Analysis Pipeline
Input: "Buy the Best LAPTOP Deals!!!" 1. Tokenize: ["Buy", "the", "Best", "LAPTOP", "Deals", "!!!"] 2. Lowercase: ["buy", "the", "best", "laptop", "deals", "!!!"] 3. Remove stop words: ["buy", "best", "laptop", "deals"] 4. Stem (Porter/Snowball): ["buy", "best", "laptop", "deal"] 5. Remove punctuation/special chars: ["buy", "best", "laptop", "deal"] Result tokens: buy, best, laptop, deal
Apply the same analysis to queries as to documents — the query “buying laptops” should match “laptop deals” after stemming.
Indexing Pipeline (Near-Real-Time)
New or updated documents must appear in search results within seconds. Pipeline:
- Source change (new product, updated price) → Kafka event
- Indexing worker consumes event → fetches full document from DB
- Applies text analysis, builds field mappings
- Elasticsearch API: PUT /products/_doc/{id} with the document JSON
- Elasticsearch updates the inverted index (near real-time: segment refresh every 1s)
- Document appears in search results within ~1 second
Elasticsearch Index Design
PUT /products
{
"mappings": {
"properties": {
"title": {"type": "text", "analyzer": "english"}, // full-text search
"brand": {"type": "keyword"}, // exact match, faceting
"price": {"type": "float"}, // range queries
"category": {"type": "keyword"}, // faceting
"description": {"type": "text", "analyzer": "english"},
"tags": {"type": "keyword"},
"created_at": {"type": "date"},
"title_suggest": {"type": "completion"} // autocomplete
}
},
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1,
"refresh_interval": "1s"
}
}
Relevance Ranking
Elasticsearch uses BM25 by default: scores documents by term frequency (TF) in the document, inverse document frequency (IDF) across the corpus, and document length normalization. Boosting: multiply the score for matches in the title vs body: title match gets 3x boost ({“multi_match”: {“query”: “laptop”, “fields”: [“title^3”, “description”]}}). Custom ranking: fetch top-N (e.g., 1000) candidates from Elasticsearch, then re-rank with a ML model (LambdaMART, LightGBM) that incorporates user signals: CTR, purchase rate, personalization features from the feature store.
Key Design Decisions
- Kafka-driven indexing pipeline — near-real-time updates without DB polling
- Text vs keyword field types — text for full-text search, keyword for exact match and faceting
- BM25 for base ranking, ML re-ranking on top-N candidates
- Shard count = number_of_nodes (shards cannot be increased after index creation)
- Refresh interval=1s — documents searchable within 1s of indexing
. Performance: aggregations run on all matching documents — expensive for high-cardinality fields. Optimize: use keyword fields (not analyzed), limit size to top-N values, cache aggregation results with the query results (Redis, TTL=60s).”}}]}
Google system design covers search indexing at scale. See common questions for Google interview: search indexing and information retrieval design.
LinkedIn system design covers full-text search indexing. Review patterns for LinkedIn interview: search indexing and people search design.
Amazon system design covers e-commerce search indexing. See design patterns for Amazon interview: product search indexing system design.
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture