A full-text search engine finds documents matching a text query from a large corpus. Building search for a product catalog, a blog, or a code repository requires: tokenizing text into searchable terms, building an inverted index for fast lookup, ranking results by relevance, and handling complex queries (phrase matching, fuzzy search, facets). Elasticsearch and Solr are production search engines; understanding their internals is a common system design interview topic.
Inverted Index
An inverted index maps each term to the list of documents containing that term. Building the index: (1) Tokenization: split text into tokens (“Apple’s iPhone” → [“apple”, “iphone”]). (2) Normalization: lowercase, remove punctuation, apply stemming (“running” → “run”) or lemmatization. (3) Stop word removal: filter common words (“the”, “is”, “and”) that appear in almost every document and are not useful for ranking. (4) Index storage: for each term, store a posting list — a sorted list of (document_id, term_frequency, positions[]). Positions enable phrase queries (“new york” must have “york” immediately after “new”). Index structure: an immutable segment file on disk (similar to an SSTable). Multiple segments are merged periodically (like LSM-tree compaction). Lucene (the engine behind Elasticsearch) uses this segment-based architecture. Query execution: for a single-term query “iphone”, look up the posting list directly — O(1) lookup by term in the term dictionary. For multi-term “iphone case”: intersect the posting lists for “iphone” and “case” — documents containing both. Posting list intersection: since lists are sorted by document_id, a two-pointer merge finds common entries in O(n + m) time.
Relevance Ranking with TF-IDF and BM25
Matching documents must be ranked by relevance — the most relevant result should appear first. TF-IDF (Term Frequency-Inverse Document Frequency): score(term, doc) = TF(term, doc) * IDF(term). TF (term frequency): how many times the term appears in the document. A document mentioning “iphone” 10 times is more relevant than one mentioning it once. Normalize by document length to avoid penalizing short documents. IDF (inverse document frequency): log(total_docs / docs_containing_term). Terms appearing in many documents (“phone”) are less discriminating than rare terms (“OLED”). BM25 (Best Match 25): an improved TF-IDF that saturates TF at high frequencies (the 100th occurrence of a term matters much less than the 1st) and penalizes long documents more accurately. BM25 is the default ranking function in Elasticsearch and Solr. The BM25 score is computed for each matching document; results are returned sorted by score descending. Multi-field ranking: weight the title field more heavily than the body — a match in the title is more relevant than in the description.
Query Types and Features
Phrase queries: “new york” must match documents where “york” immediately follows “new”. Requires position data in the posting list. Slower than single-term queries. Fuzzy matching: “iphon” should match “iphone” (typo tolerance). Computed via edit distance (Levenshtein). In Elasticsearch: fuzzy query with fuzziness=AUTO. Autocomplete / prefix queries: “ipho*” matches all terms starting with “ipho”. Implemented with a prefix-indexed edge n-gram tokenizer or a prefix tree. Faceted search (aggregations): count results by category, price range, or color alongside the results. Implemented by maintaining a separate field-indexed structure (doc_values in Elasticsearch — column-oriented storage per field). Facet computation at query time: scan the matching documents and aggregate by field value. Highlighting: return the sentence containing the search term for the result snippet. Use the positions from the posting list to locate the term in the document text.
Indexing Pipeline and Real-Time Search
Documents must be indexed before they can be searched. Indexing pipeline: (1) Document ingested (new product added, article published). (2) Tokenization and analysis: run the document through the analysis pipeline (tokenizer + filters). (3) Update the in-memory write buffer (Elasticsearch calls this an in-memory segment). (4) Refresh: periodically flush the write buffer to a new searchable segment (Elasticsearch default: every 1 second). Documents are searchable within 1 second of indexing — near real-time (NRT) search. (5) Merge: background thread periodically merges small segments into larger ones (fewer segments = faster queries). Bulk indexing (reindexing a large corpus): temporarily disable segment merges, increase the refresh interval to 30 seconds, and use the bulk API to send batches of documents. Re-enable after indexing completes — this 3-4x speeds up bulk ingestion by reducing overhead. Real-time search (millisecond freshness): some use cases (live chat, financial data) require sub-second index freshness. Use a write-through cache or in-memory search layer alongside the disk-based index.
Distributed Search Architecture
A single machine can’t index and search a corpus of billions of documents. Sharding: split the index across N shards (primary shards). Each document is routed to a shard by hash(document_id) % num_shards. A search query fans out to all shards, each returns its top-K results, and the coordinating node merges and re-ranks the combined results. Replicas: each primary shard has R replica shards (Elasticsearch default: 1 replica). Replicas serve read (search) requests — scale read throughput by adding replicas. Writes go to the primary, then replicate to replicas. Query coordination: the coordinating node receives the search request, fans out to all primary or replica shards (round-robin), collects top-K per shard, merges (global sort by score), fetches the full documents for the top results, and returns. The distributed BM25 problem: IDF depends on the corpus size (total docs, docs with the term) — computed per shard in distributed mode, not globally. IDF may be slightly inaccurate across shards unless using a global IDF (expensive to compute). In practice, with large corpora, the per-shard approximation is accurate enough.