Q: How do you implement faceted search with Elasticsearch?

Faceted search shows category counts alongside results: "Show all 523 results for laptop — Brand: Dell (180), Apple (120), HP (95)..." Elasticsearch aggregations implement facets: the terms aggregation counts documents by a keyword field value. Example query: search for "laptop" with a terms aggregation on "brand" field: {"aggs": {"brands": {"terms": {"field": "brand", "size": 10}}}. Returns: results + brand counts. Filtered facets: when the user selects "Dell", re-query with a filter: {"filter": {"term": {"brand": "Dell

Question 1

How does an inverted index work and why is it faster than a full scan?

Accepted Answer

An inverted index maps each term to the list of documents containing it. Building: for each document, tokenize and normalize the text, then for each token, add the document ID to the token's posting list. Example: "laptop deals" → tokens ["laptop", "deal"] → add doc_id to posting lists for each. Query: for "buy laptop", look up the posting lists for "buy" and "laptop" and intersect them. The intersection is done via merge of sorted lists — O(m + n) where m and n are the posting list lengths. A full scan would be O(D × L) where D = number of documents and L = average document length. For 1 billion documents with average 100 words each: full scan = 100 billion comparisons per query. Inverted index: "laptop" might appear in 10 million documents; "buy laptop" intersection is much smaller. The inverted index reduces the search space from all documents to only documents containing the query terms.

Question 2

What is BM25 and how does it rank search results?

Accepted Answer

BM25 (Best Match 25) is the standard ranking function for full-text search, used by Elasticsearch, Lucene, and most search engines. It improves on TF-IDF with two key features: (1) Term frequency saturation: in TF-IDF, each additional occurrence of a term increases the score linearly. In BM25, the score plateaus — a document with the query term 100 times is not ranked 100x higher than one with it 10 times. Formula includes (tf * (k+1)) / (tf + k), where k is typically 1.2. (2) Document length normalization: shorter documents that contain the query term rank higher than longer documents with the same term frequency (the term is more central to the shorter document). BM25 score = IDF(t) * (tf * (k+1)) / (tf + k * (1 - b + b * dl/avgdl)), where b controls length normalization (typically 0.75). IDF = inverse document frequency — rare terms are more informative than common terms.

Question 3

How do you design Elasticsearch mappings for an e-commerce product search?

Accepted Answer

Elasticsearch field types determine how fields are indexed and queried. Key choices: text type: tokenized and analyzed for full-text search. Used for title, description. Queries: match, multi_match. keyword type: indexed as-is for exact matching, sorting, and aggregations (faceting). Used for brand, category, product_id. Enables: filter by brand="Apple", facet counts by category. numeric types (integer, float): used for price, rating. Enables: range filters (price: 100-500), sorting by price. date type: for created_at, last_updated. Enables: date range filters, sorting by recency. completion type: for autocomplete/typeahead suggestions. nested type: for arrays of objects (product variants) where you need to query individual elements. Sharding: 5 primary shards for a large product catalog (100M+ docs). Replicas: 1 (total 10 shards — 5 primary + 5 replica). Rule of thumb: one shard = 20-40GB of data. Plan shards at index creation — cannot increase later without reindexing.

Question 4

How do you keep a search index in sync with the primary database?

Accepted Answer

Two common approaches: (1) CDC (Change Data Capture) via Kafka: Debezium reads the database write-ahead log and publishes change events (INSERT, UPDATE, DELETE) to Kafka topics. An Elasticsearch indexing service consumes these events and updates the index. Pros: real-time (< 1 second lag), no polling, no DB load. Cons: requires Debezium setup and Kafka infrastructure. (2) Application-level dual writes: after writing to the DB, the application explicitly calls the Elasticsearch indexing API. Pros: simple, no extra infrastructure. Cons: the DB and index can go out of sync if the Elasticsearch write fails. Must use retry logic and periodic full re-sync. (3) Periodic full re-sync: a nightly job re-indexes all documents. Ensures eventual consistency but up to 24 hours stale. For production: CDC via Kafka is the recommended approach — reliable, near-real-time, decoupled from application code. Always include a periodic full re-sync as a safety net for missed events.

Question 5

How do you implement faceted search with Elasticsearch?

Accepted Answer

Faceted search shows category counts alongside results: "Show all 523 results for laptop — Brand: Dell (180), Apple (120), HP (95)..." Elasticsearch aggregations implement facets: the terms aggregation counts documents by a keyword field value. Example query: search for "laptop" with a terms aggregation on "brand" field: {"aggs": {"brands": {"terms": {"field": "brand", "size": 10}}}. Returns: results + brand counts. Filtered facets: when the user selects "Dell", re-query with a filter: {"filter": {"term": {"brand": "Dell

Search Indexing System Low-Level Design

What is a Search Index?

Architecture

Inverted Index Basics

Text Analysis Pipeline

Indexing Pipeline (Near-Real-Time)

Elasticsearch Index Design

Relevance Ranking

Key Design Decisions