Search Indexing System Low-Level Design

What is a Search Index?

A search index enables fast full-text queries over large datasets. Without an index, searching 1 billion documents for “buy laptop” requires scanning all documents — O(n). With an inverted index, the query is answered in O(k) where k is the result count. Elasticsearch, Solr, and Lucene are built on inverted indexes. Building a search index involves: document ingestion, text analysis (tokenization, normalization), inverted index construction, and ranking (BM25, TF-IDF).

Architecture

Data Sources (DB, Kafka events) → Indexing Pipeline
                                   → Text Analysis (tokenize, lowercase, stem, stop words)
                                   → Elasticsearch Index
                                   → Serving Layer (Search API)
                                     → Query parsing
                                     → Elasticsearch query
                                     → Re-ranking (ML model)
                                     → Results returned to client

Inverted Index Basics

An inverted index maps each term to the list of documents containing it:

Document 1: "buy laptop deals"
Document 2: "best laptop price"
Document 3: "buy desktop computer"

Inverted Index:
  "buy"     → [doc1, doc3]
  "laptop"  → [doc1, doc2]
  "deals"   → [doc1]
  "best"    → [doc2]
  "price"   → [doc2]
  "desktop" → [doc3]

Query "buy laptop" → intersect([doc1, doc3], [doc1, doc2]) → [doc1]

Text Analysis Pipeline

Input: "Buy the Best LAPTOP Deals!!!"
1. Tokenize: ["Buy", "the", "Best", "LAPTOP", "Deals", "!!!"]
2. Lowercase: ["buy", "the", "best", "laptop", "deals", "!!!"]
3. Remove stop words: ["buy", "best", "laptop", "deals"]
4. Stem (Porter/Snowball): ["buy", "best", "laptop", "deal"]
5. Remove punctuation/special chars: ["buy", "best", "laptop", "deal"]

Result tokens: buy, best, laptop, deal

Apply the same analysis to queries as to documents — the query “buying laptops” should match “laptop deals” after stemming.

Indexing Pipeline (Near-Real-Time)

New or updated documents must appear in search results within seconds. Pipeline:

  1. Source change (new product, updated price) → Kafka event
  2. Indexing worker consumes event → fetches full document from DB
  3. Applies text analysis, builds field mappings
  4. Elasticsearch API: PUT /products/_doc/{id} with the document JSON
  5. Elasticsearch updates the inverted index (near real-time: segment refresh every 1s)
  6. Document appears in search results within ~1 second

Elasticsearch Index Design

PUT /products
{
  "mappings": {
    "properties": {
      "title": {"type": "text", "analyzer": "english"},  // full-text search
      "brand": {"type": "keyword"},                       // exact match, faceting
      "price": {"type": "float"},                         // range queries
      "category": {"type": "keyword"},                    // faceting
      "description": {"type": "text", "analyzer": "english"},
      "tags": {"type": "keyword"},
      "created_at": {"type": "date"},
      "title_suggest": {"type": "completion"}             // autocomplete
    }
  },
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1,
    "refresh_interval": "1s"
  }
}

Relevance Ranking

Elasticsearch uses BM25 by default: scores documents by term frequency (TF) in the document, inverse document frequency (IDF) across the corpus, and document length normalization. Boosting: multiply the score for matches in the title vs body: title match gets 3x boost ({“multi_match”: {“query”: “laptop”, “fields”: [“title^3”, “description”]}}). Custom ranking: fetch top-N (e.g., 1000) candidates from Elasticsearch, then re-rank with a ML model (LambdaMART, LightGBM) that incorporates user signals: CTR, purchase rate, personalization features from the feature store.

Key Design Decisions

  • Kafka-driven indexing pipeline — near-real-time updates without DB polling
  • Text vs keyword field types — text for full-text search, keyword for exact match and faceting
  • BM25 for base ranking, ML re-ranking on top-N candidates
  • Shard count = number_of_nodes (shards cannot be increased after index creation)
  • Refresh interval=1s — documents searchable within 1s of indexing

. Performance: aggregations run on all matching documents — expensive for high-cardinality fields. Optimize: use keyword fields (not analyzed), limit size to top-N values, cache aggregation results with the query results (Redis, TTL=60s).”}}]}

Google system design covers search indexing at scale. See common questions for Google interview: search indexing and information retrieval design.

LinkedIn system design covers full-text search indexing. Review patterns for LinkedIn interview: search indexing and people search design.

Amazon system design covers e-commerce search indexing. See design patterns for Amazon interview: product search indexing system design.

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top