Full-text search finds documents matching natural language queries across large corpora. Unlike exact-match database queries (WHERE name = ‘foo’), full-text search handles stemming (searching “running” matches “run”, “runs”), relevance ranking (most relevant results first), synonyms (“automobile” matches “car”), and fuzzy matching (typo tolerance). Elasticsearch and Apache Solr are the dominant implementations, both built on Apache Lucene. Understanding the internals — inverted indexes, BM25 scoring, and indexing pipelines — is essential for designing performant search.
Inverted Index
The inverted index maps terms to the documents containing them. For a document corpus, the analyzer tokenizes text into terms (“The quick brown fox” → [“quick”, “brown”, “fox”]), applies filters (lowercase, stemming, stop word removal), and adds each term to the index with a posting list: the list of (document_id, term_frequency, positions) for all documents containing the term. A query for “brown fox” looks up both terms in the inverted index, intersects the posting lists to find documents containing both, and scores them by relevance. Inverted index lookup is O(1) for term lookup and O(posting_list_length) for intersection.
Text Analysis Pipeline
The analysis pipeline transforms raw text into indexed terms. Standard pipeline: character filters (strip HTML, normalize Unicode) → tokenizer (split on whitespace, punctuation) → token filters (lowercase, stop words removal [“the”, “a”, “is”], stemming [Porter/Snowball: “running” → “run”], synonyms [“couch” → “sofa”], n-grams for partial matching). Configure the same analyzer for indexing and querying — if you stem at index time, you must stem at query time or queries like “running” won’t match indexed term “run”. Custom analyzers for specific domains: e-commerce (handle SKUs, model numbers), code search (split on camelCase), multilingual (language-specific stemmers).
BM25 Relevance Scoring
BM25 (Best Match 25) is the standard relevance ranking formula. Score for a document D for query term t: score = IDF(t) * (TF(t,D) * (k1+1)) / (TF(t,D) + k1*(1 – b + b*|D|/avgDL)). IDF (Inverse Document Frequency): terms appearing in fewer documents get higher weight — “quantum” is more discriminative than “the”. TF (Term Frequency): documents with more occurrences of the query term score higher, with diminishing returns (k1 parameter, typically 1.2-2.0). Field length normalization: shorter documents that contain the query term score higher than longer documents (b parameter, typically 0.75). Tune k1 and b for your corpus — shorter documents (titles, product names) benefit from lower b.
Elasticsearch Index Architecture
An Elasticsearch index is divided into shards (primary shards handle writes and reads; replicas handle reads). Choose shard count based on data volume: target 20-40GB per shard. Too many small shards wastes overhead; too-large shards slow search (less parallelism). Shards cannot be split after creation — plan ahead. For time-series data (logs, events), use index aliases + time-based indices (logs-2024-04-15) with a write alias pointing to today’s index. Roll over to a new index daily or when the index reaches a size threshold. Use Index Lifecycle Management (ILM) to automatically transition old indices to warm/cold/frozen tiers and eventually delete them.
Query Types
Match query: standard full-text query — analyzes the query string and searches the inverted index. Multi-match: search across multiple fields simultaneously (search title and description, boost title matches 2x). Term query: exact match without analysis (for IDs, enums, keywords — not full text). Range query: numeric or date range (price: 10-100, date: last 7 days). Bool query: combines queries with must (AND), should (OR, boosts score), must_not (NOT). Function score query: combine BM25 relevance with a custom scoring function (boost results from in-stock products, boost recent results via recency decay, boost products with higher seller ratings). Nested queries: search within nested objects (product variants).
Indexing Pipeline and Freshness
Documents reach the search index through an indexing pipeline. Sync from primary database: change data capture (CDC) publishes database changes to Kafka; a Kafka consumer processes events and calls the Elasticsearch bulk API. Batch reindex: for full rebuilds (mapping changes, analyzer changes), reindex all documents from the database to a new index, then swap the alias. Near-real-time: Elasticsearch makes new documents searchable within 1 second (the refresh interval) — not immediately on write. If instant search visibility is required, set refresh_interval=1s (default) or call _refresh explicitly after writes. Do not set refresh_interval=”1″ for bulk indexing — disable it during bulk load and re-enable after.
Search Quality Tuning
Search quality is measured by Precision (fraction of returned results that are relevant) and Recall (fraction of relevant results that are returned). For e-commerce: Normalized Discounted Cumulative Gain (NDCG) measures whether the most relevant results appear at the top of the list. Tuning: collect user click data (CTR by rank position), use Learning to Rank (LTR) to train a model that reranks results based on user signals, tune analyzer stopwords and synonyms based on zero-result queries, add query spelling correction (did you mean?), and implement query expansion (automatically add synonyms to the query). A/B test search algorithm changes using the same experimentation platform used for product changes.