What is an inverted index and how does it power full-text search?

An inverted index maps each unique term to the list of documents (posting list) containing it. At index time: documents are tokenized into terms, which are lowercased, stemmed, and filtered. Each term is added to the index with a posting list of (document_id, term_frequency, positions). At query time: look up each query term in the index, intersect posting lists to find documents containing all terms, and score by relevance. This makes term lookup O(1) and intersection O(posting_list_length), enabling sub-millisecond search across millions of documents.

How does BM25 determine which search results are most relevant?

BM25 scores documents by two factors: IDF (Inverse Document Frequency) — terms that appear in fewer documents are more discriminative and get higher weight ("quantum" outweighs "the"); and TF (Term Frequency) — documents with more occurrences of the query term score higher, but with diminishing returns (the k1 parameter, default 1.2, controls saturation). Field length normalization (b parameter, default 0.75) boosts shorter documents — a 10-word product title matching the query ranks higher than a 10,000-word document with the same term frequency. Tune k1 and b based on your corpus characteristics.

Why must the same analyzer be used at index time and query time?

The analyzer transforms text into terms both when indexing and when searching. If you stem "running" to "run" at index time but search for "running" without stemming at query time, the query term "running" won't match the indexed term "run" — the document won't appear in results. Consistency is mandatory: the same tokenizer, lowercasing, stop word list, and stemmer must be applied to both the document being indexed and the query being searched. Elasticsearch uses the same analyzer for both by default when you specify the analyzer on the field mapping.

How do you measure and improve search result quality?

Measure with NDCG (Normalized Discounted Cumulative Gain): score how well the most relevant results appear at the top of the ranking. Collect user signals: click-through rate by rank position (CTR@1, CTR@5), zero-result rate (queries returning no results), reformulation rate (user immediately tries a different query). Improve with: query-time synonym expansion ("automobile" → "automobile" OR "car"), learning to rank (LTR) using XGBoost or neural models trained on click data, spell correction (did you mean?), and query understanding (classify query intent). A/B test search algorithm changes using your experimentation platform, measuring both CTR and business conversion.

Low Level Design: Full-Text Search Index Design

⏱ 6 min read

Full-text search finds documents matching natural language queries across large corpora. Unlike exact-match database queries (WHERE name = ‘foo’), full-text search handles stemming (searching “running” matches “run”, “runs”), relevance ranking (most relevant results first), synonyms (“automobile” matches “car”), and fuzzy matching (typo tolerance). Elasticsearch and Apache Solr are the dominant implementations, both built on Apache Lucene. Understanding the internals — inverted indexes, BM25 scoring, and indexing pipelines — is essential for designing performant search.

Inverted Index

The inverted index maps terms to the documents containing them. For a document corpus, the analyzer tokenizes text into terms (“The quick brown fox” → [“quick”, “brown”, “fox”]), applies filters (lowercase, stemming, stop word removal), and adds each term to the index with a posting list: the list of (document_id, term_frequency, positions) for all documents containing the term. A query for “brown fox” looks up both terms in the inverted index, intersects the posting lists to find documents containing both, and scores them by relevance. Inverted index lookup is O(1) for term lookup and O(posting_list_length) for intersection.

Text Analysis Pipeline

The analysis pipeline transforms raw text into indexed terms. Standard pipeline: character filters (strip HTML, normalize Unicode) → tokenizer (split on whitespace, punctuation) → token filters (lowercase, stop words removal [“the”, “a”, “is”], stemming [Porter/Snowball: “running” → “run”], synonyms [“couch” → “sofa”], n-grams for partial matching). Configure the same analyzer for indexing and querying — if you stem at index time, you must stem at query time or queries like “running” won’t match indexed term “run”. Custom analyzers for specific domains: e-commerce (handle SKUs, model numbers), code search (split on camelCase), multilingual (language-specific stemmers).

BM25 Relevance Scoring

BM25 (Best Match 25) is the standard relevance ranking formula. Score for a document D for query term t: score = IDF(t) * (TF(t,D) * (k1+1)) / (TF(t,D) + k1*(1 – b + b*|D|/avgDL)). IDF (Inverse Document Frequency): terms appearing in fewer documents get higher weight — “quantum” is more discriminative than “the”. TF (Term Frequency): documents with more occurrences of the query term score higher, with diminishing returns (k1 parameter, typically 1.2-2.0). Field length normalization: shorter documents that contain the query term score higher than longer documents (b parameter, typically 0.75). Tune k1 and b for your corpus — shorter documents (titles, product names) benefit from lower b.

Elasticsearch Index Architecture

An Elasticsearch index is divided into shards (primary shards handle writes and reads; replicas handle reads). Choose shard count based on data volume: target 20-40GB per shard. Too many small shards wastes overhead; too-large shards slow search (less parallelism). Shards cannot be split after creation — plan ahead. For time-series data (logs, events), use index aliases + time-based indices (logs-2024-04-15) with a write alias pointing to today’s index. Roll over to a new index daily or when the index reaches a size threshold. Use Index Lifecycle Management (ILM) to automatically transition old indices to warm/cold/frozen tiers and eventually delete them.

Query Types

Match query: standard full-text query — analyzes the query string and searches the inverted index. Multi-match: search across multiple fields simultaneously (search title and description, boost title matches 2x). Term query: exact match without analysis (for IDs, enums, keywords — not full text). Range query: numeric or date range (price: 10-100, date: last 7 days). Bool query: combines queries with must (AND), should (OR, boosts score), must_not (NOT). Function score query: combine BM25 relevance with a custom scoring function (boost results from in-stock products, boost recent results via recency decay, boost products with higher seller ratings). Nested queries: search within nested objects (product variants).

Indexing Pipeline and Freshness

Documents reach the search index through an indexing pipeline. Sync from primary database: change data capture (CDC) publishes database changes to Kafka; a Kafka consumer processes events and calls the Elasticsearch bulk API. Batch reindex: for full rebuilds (mapping changes, analyzer changes), reindex all documents from the database to a new index, then swap the alias. Near-real-time: Elasticsearch makes new documents searchable within 1 second (the refresh interval) — not immediately on write. If instant search visibility is required, set refresh_interval=1s (default) or call _refresh explicitly after writes. Do not set refresh_interval=”1″ for bulk indexing — disable it during bulk load and re-enable after.

Search Quality Tuning

Search quality is measured by Precision (fraction of returned results that are relevant) and Recall (fraction of relevant results that are returned). For e-commerce: Normalized Discounted Cumulative Gain (NDCG) measures whether the most relevant results appear at the top of the list. Tuning: collect user click data (CTR by rank position), use Learning to Rank (LTR) to train a model that reranks results based on user signals, tune analyzer stopwords and synonyms based on zero-result queries, add query spelling correction (did you mean?), and implement query expansion (automatically add synonyms to the query). A/B test search algorithm changes using the same experimentation platform used for product changes.