What Is a Full-Text Search Engine?
A full-text search engine indexes large collections of text documents and retrieves the most relevant ones for a given query in milliseconds. The core challenge is that naive linear scan (grep every document) doesn’t scale past thousands of documents. The solution is to pre-build an inverted index at write time so reads are fast lookups rather than full scans. Systems like Lucene, Elasticsearch, Solr, and Typesense are built on this foundation.
Inverted Index Structure
An inverted index maps each unique term to a posting list: the list of documents containing that term. Each entry in the posting list typically stores the document ID, the term frequency within that document, and optionally the positions (character offsets) where the term appears. For example:
"database" → [(doc3, freq=4, positions=[12,87,204,391]), (doc7, freq=1, positions=[55]), ...]
At query time, looking up a term is a direct map lookup (O(1) or O(log n)), returning a posting list that can be intersected, unioned, or ranked. The inverted index is the reason search engines return results in milliseconds over billions of documents.
Tokenization and Normalization
Tokenization splits raw text into terms (tokens). The simplest tokenizer splits on whitespace and punctuation. More sophisticated analyzers handle contractions (don’t → do, n’t), hyphenated words, URLs, and language-specific rules. The output is a stream of tokens that will be stored in the inverted index.
Normalization ensures that Database, DATABASE, and database all map to the same term. This includes lowercasing and Unicode normalization (NFC/NFD). Without normalization, queries would be case-sensitive and miss obvious matches. Normalization is applied identically at index time and query time — both the stored terms and the query terms go through the same analysis pipeline.
Stemming and Stop Words
Stemming reduces inflected or derived words to a common base form. The Porter stemmer (1980) and its successor Snowball handle English: running, runs, and runner all reduce to run. This increases recall — a query for "run" matches documents containing "running." The tradeoff is occasional over-stemming (university and universe both stem to univers) and under-stemming. Lemmatization is more linguistically accurate (uses dictionary lookups) but slower.
Stop words are extremely common words (a, the, is, of) that appear in nearly every document and add little discriminative signal. Removing them at index time reduces index size and speeds up posting list intersections. However, some use cases require stop words — phrase queries like "to be or not to be" break without them — so modern engines often keep stop words but downweight them rather than remove them entirely.
TF-IDF Scoring
TF-IDF (Term Frequency–Inverse Document Frequency) is the classic relevance scoring formula. TF measures how often the query term appears in a document (more occurrences = more relevant). IDF measures how rare the term is across the entire corpus — rare terms are more discriminative. A term appearing in every document gets IDF ≈ 0 (it’s useless for ranking); a term appearing in only one document gets high IDF.
The combined TF-IDF score for a term in a document: score = TF(t,d) × log(N / df(t)), where N is the total number of documents and df(t) is the number of documents containing term t. The final score for a multi-term query is the sum of TF-IDF scores across all query terms. Documents are ranked by descending score.
BM25: The Current Standard
Okapi BM25 (Best Match 25) improved on TF-IDF in two important ways. First, it adds TF saturation: in raw TF-IDF, a document with a term 100 times scores 10x more than one with 10 times. BM25 uses a saturation curve controlled by parameter k1 (typically 1.2–2.0): additional occurrences matter less and less. Second, it adds document length normalization via parameter b (typically 0.75): a term appearing 5 times in a short document is more significant than in a long document.
BM25 formula: score(t,d) = IDF(t) × (TF(t,d) × (k1+1)) / (TF(t,d) + k1 × (1 - b + b × |d|/avgdl)). BM25 is now the default ranking function in Elasticsearch (since version 5), Lucene, and most modern search engines. It consistently outperforms plain TF-IDF on standard IR benchmarks.
Phrase Queries and Positional Index
A phrase query — "machine learning" — requires that both terms appear adjacent in the correct order, not just anywhere in the document. This requires storing positions in the posting list: the character or token offset of each term occurrence. At query time, after finding documents containing both "machine" and "learning," the engine checks whether any occurrence of "machine" is immediately followed by "learning."
Proximity queries generalize this: find documents where two terms appear within N tokens of each other. Positional data significantly increases index size (often 3–5x vs. no positions) but is required for phrase search. Lucene stores positions as delta-encoded integers within the posting list, keeping the overhead manageable.
Inverted Index Compression
Posting lists for common terms can contain millions of document IDs. Without compression, the index would be enormous. Two key techniques keep it manageable:
Delta encoding of doc IDs: posting lists are sorted by doc ID. Instead of storing raw doc IDs, store differences between consecutive IDs. Since IDs increase monotonically, deltas are small positive integers. A list like [3, 17, 32, 33, 50] becomes [3, 14, 15, 1, 17]. Small integers compress far better than large ones.
Variable-byte encoding: use fewer bytes for small integers and more for large ones. Numbers < 128 fit in 1 byte; numbers < 16384 fit in 2 bytes. This achieves good compression without the decoding overhead of bit-packing schemes. Lucene also uses PForDelta (PFOR) and Simple9/Simple16 for block-level integer compression on large posting lists.
Segment-Based Architecture (Lucene)
Lucene organizes its index as a collection of immutable segments. When documents are indexed, they first go into a small in-memory buffer. Periodically (or when the buffer is full), the buffer is flushed to disk as a new segment — a complete, self-contained mini-index. Each segment has its own inverted index, stored fields, doc values, and term dictionary.
Segments are immutable: once written, they are never modified. Deletes are handled with a deletion bitset — a bitmap marking deleted doc IDs — applied at query time to filter out results. The trade-off: many small segments slow down queries (must search all segments and merge results). A background merge process continuously merges smaller segments into larger ones, following a tiered merge policy. Merging also physically removes deleted documents and rewrites the index compactly.
Distributed Search: Elasticsearch
Elasticsearch distributes a Lucene index across multiple nodes using shards. An index is divided into N primary shards (chosen at index creation, cannot be changed without reindex). Each primary shard is a full Lucene index. Replicas are read-only copies of primary shards on different nodes, providing both fault tolerance and read scaling.
A query uses scatter-gather: the coordinating node fans out the query to one copy of each shard, each shard returns its top-K results with scores, the coordinator merges all results globally, re-ranks, and returns the final top-K. Shard routing for document writes uses hash(doc_id) % num_shards by default, distributing documents evenly. Custom routing (e.g., by tenant ID) can co-locate related documents on the same shard for query efficiency. Relevance tuning options include field-level boosts (title matches outweigh body matches), function scores (boost by recency or popularity), and query-time synonym expansion.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering