Search Indexing System Low-Level Design

What is a Search Index?

A search index enables fast full-text queries over large datasets. Without an index, searching 1 billion documents for “buy laptop” requires scanning all documents — O(n). With an inverted index, the query is answered in O(k) where k is the result count. Elasticsearch, Solr, and Lucene are built on inverted indexes. Building a search index involves: document ingestion, text analysis (tokenization, normalization), inverted index construction, and ranking (BM25, TF-IDF).

Architecture

Data Sources (DB, Kafka events) → Indexing Pipeline
                                   → Text Analysis (tokenize, lowercase, stem, stop words)
                                   → Elasticsearch Index
                                   → Serving Layer (Search API)
                                     → Query parsing
                                     → Elasticsearch query
                                     → Re-ranking (ML model)
                                     → Results returned to client

Inverted Index Basics

An inverted index maps each term to the list of documents containing it:

Document 1: "buy laptop deals"
Document 2: "best laptop price"
Document 3: "buy desktop computer"

Inverted Index:
  "buy"     → [doc1, doc3]
  "laptop"  → [doc1, doc2]
  "deals"   → [doc1]
  "best"    → [doc2]
  "price"   → [doc2]
  "desktop" → [doc3]

Query "buy laptop" → intersect([doc1, doc3], [doc1, doc2]) → [doc1]

Text Analysis Pipeline

Input: "Buy the Best LAPTOP Deals!!!"
1. Tokenize: ["Buy", "the", "Best", "LAPTOP", "Deals", "!!!"]
2. Lowercase: ["buy", "the", "best", "laptop", "deals", "!!!"]
3. Remove stop words: ["buy", "best", "laptop", "deals"]
4. Stem (Porter/Snowball): ["buy", "best", "laptop", "deal"]
5. Remove punctuation/special chars: ["buy", "best", "laptop", "deal"]

Result tokens: buy, best, laptop, deal

Apply the same analysis to queries as to documents — the query “buying laptops” should match “laptop deals” after stemming.

Indexing Pipeline (Near-Real-Time)

New or updated documents must appear in search results within seconds. Pipeline:

  1. Source change (new product, updated price) → Kafka event
  2. Indexing worker consumes event → fetches full document from DB
  3. Applies text analysis, builds field mappings
  4. Elasticsearch API: PUT /products/_doc/{id} with the document JSON
  5. Elasticsearch updates the inverted index (near real-time: segment refresh every 1s)
  6. Document appears in search results within ~1 second

Elasticsearch Index Design

PUT /products
{
  "mappings": {
    "properties": {
      "title": {"type": "text", "analyzer": "english"},  // full-text search
      "brand": {"type": "keyword"},                       // exact match, faceting
      "price": {"type": "float"},                         // range queries
      "category": {"type": "keyword"},                    // faceting
      "description": {"type": "text", "analyzer": "english"},
      "tags": {"type": "keyword"},
      "created_at": {"type": "date"},
      "title_suggest": {"type": "completion"}             // autocomplete
    }
  },
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1,
    "refresh_interval": "1s"
  }
}

Relevance Ranking

Elasticsearch uses BM25 by default: scores documents by term frequency (TF) in the document, inverse document frequency (IDF) across the corpus, and document length normalization. Boosting: multiply the score for matches in the title vs body: title match gets 3x boost ({“multi_match”: {“query”: “laptop”, “fields”: [“title^3”, “description”]}}). Custom ranking: fetch top-N (e.g., 1000) candidates from Elasticsearch, then re-rank with a ML model (LambdaMART, LightGBM) that incorporates user signals: CTR, purchase rate, personalization features from the feature store.

Key Design Decisions

  • Kafka-driven indexing pipeline — near-real-time updates without DB polling
  • Text vs keyword field types — text for full-text search, keyword for exact match and faceting
  • BM25 for base ranking, ML re-ranking on top-N candidates
  • Shard count = number_of_nodes (shards cannot be increased after index creation)
  • Refresh interval=1s — documents searchable within 1s of indexing


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does an inverted index work and why is it faster than a full scan?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”An inverted index maps each term to the list of documents containing it. Building: for each document, tokenize and normalize the text, then for each token, add the document ID to the token's posting list. Example: "laptop deals" → tokens ["laptop", "deal"] → add doc_id to posting lists for each. Query: for "buy laptop", look up the posting lists for "buy" and "laptop" and intersect them. The intersection is done via merge of sorted lists — O(m + n) where m and n are the posting list lengths. A full scan would be O(D × L) where D = number of documents and L = average document length. For 1 billion documents with average 100 words each: full scan = 100 billion comparisons per query. Inverted index: "laptop" might appear in 10 million documents; "buy laptop" intersection is much smaller. The inverted index reduces the search space from all documents to only documents containing the query terms.”}},{“@type”:”Question”,”name”:”What is BM25 and how does it rank search results?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”BM25 (Best Match 25) is the standard ranking function for full-text search, used by Elasticsearch, Lucene, and most search engines. It improves on TF-IDF with two key features: (1) Term frequency saturation: in TF-IDF, each additional occurrence of a term increases the score linearly. In BM25, the score plateaus — a document with the query term 100 times is not ranked 100x higher than one with it 10 times. Formula includes (tf * (k+1)) / (tf + k), where k is typically 1.2. (2) Document length normalization: shorter documents that contain the query term rank higher than longer documents with the same term frequency (the term is more central to the shorter document). BM25 score = IDF(t) * (tf * (k+1)) / (tf + k * (1 – b + b * dl/avgdl)), where b controls length normalization (typically 0.75). IDF = inverse document frequency — rare terms are more informative than common terms.”}},{“@type”:”Question”,”name”:”How do you design Elasticsearch mappings for an e-commerce product search?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Elasticsearch field types determine how fields are indexed and queried. Key choices: text type: tokenized and analyzed for full-text search. Used for title, description. Queries: match, multi_match. keyword type: indexed as-is for exact matching, sorting, and aggregations (faceting). Used for brand, category, product_id. Enables: filter by brand="Apple", facet counts by category. numeric types (integer, float): used for price, rating. Enables: range filters (price: 100-500), sorting by price. date type: for created_at, last_updated. Enables: date range filters, sorting by recency. completion type: for autocomplete/typeahead suggestions. nested type: for arrays of objects (product variants) where you need to query individual elements. Sharding: 5 primary shards for a large product catalog (100M+ docs). Replicas: 1 (total 10 shards — 5 primary + 5 replica). Rule of thumb: one shard = 20-40GB of data. Plan shards at index creation — cannot increase later without reindexing.”}},{“@type”:”Question”,”name”:”How do you keep a search index in sync with the primary database?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Two common approaches: (1) CDC (Change Data Capture) via Kafka: Debezium reads the database write-ahead log and publishes change events (INSERT, UPDATE, DELETE) to Kafka topics. An Elasticsearch indexing service consumes these events and updates the index. Pros: real-time (< 1 second lag), no polling, no DB load. Cons: requires Debezium setup and Kafka infrastructure. (2) Application-level dual writes: after writing to the DB, the application explicitly calls the Elasticsearch indexing API. Pros: simple, no extra infrastructure. Cons: the DB and index can go out of sync if the Elasticsearch write fails. Must use retry logic and periodic full re-sync. (3) Periodic full re-sync: a nightly job re-indexes all documents. Ensures eventual consistency but up to 24 hours stale. For production: CDC via Kafka is the recommended approach — reliable, near-real-time, decoupled from application code. Always include a periodic full re-sync as a safety net for missed events.”}},{“@type”:”Question”,”name”:”How do you implement faceted search with Elasticsearch?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Faceted search shows category counts alongside results: "Show all 523 results for laptop — Brand: Dell (180), Apple (120), HP (95)…" Elasticsearch aggregations implement facets: the terms aggregation counts documents by a keyword field value. Example query: search for "laptop" with a terms aggregation on "brand" field: {"aggs": {"brands": {"terms": {"field": "brand", "size": 10}}}. Returns: results + brand counts. Filtered facets: when the user selects "Dell", re-query with a filter: {"filter": {"term": {"brand": "Dell"}}, "aggs": …}. The facet counts update to show counts within the filtered set. Range facets: for price, use range aggregation: {"ranges": [{"to": 100}, {"from": 100, "to": 500}, {"from": 500}]}. Performance: aggregations run on all matching documents — expensive for high-cardinality fields. Optimize: use keyword fields (not analyzed), limit size to top-N values, cache aggregation results with the query results (Redis, TTL=60s).”}}]}

Google system design covers search indexing at scale. See common questions for Google interview: search indexing and information retrieval design.

LinkedIn system design covers full-text search indexing. Review patterns for LinkedIn interview: search indexing and people search design.

Amazon system design covers e-commerce search indexing. See design patterns for Amazon interview: product search indexing system design.

Scroll to Top