Document Search Service Low-Level Design: Inverted Index, Relevance Scoring, and Field Boosting

The Inverted Index

Full-text search is powered by an inverted index — a data structure that maps each unique term to the list of documents containing it. This is the inverse of a forward index (which maps documents to their terms).

Term "database" → [(doc_id=5, freq=3, positions=[12,47,89]),
                   (doc_id=12, freq=1, positions=[4]),
                   (doc_id=34, freq=7, positions=[1,2,...])]

Each entry in the postings list stores the document ID, term frequency within that document, and optionally the positions of each occurrence (required for phrase queries and highlighting). The inverted index is what enables searching millions of documents in milliseconds — instead of scanning all documents, you look up the term and get the exact document list.

Index Construction Pipeline

Building the inverted index from raw text goes through several stages:

Tokenization: split text into tokens on whitespace and punctuation. “Hello, world!” → [“Hello”, “world”].
Lowercasing: normalize to lowercase for case-insensitive matching.
Stopword removal: discard high-frequency, low-signal words (“the”, “a”, “is”). Reduces index size and improves ranking signal-to-noise.
Stemming/lemmatization: reduce words to their root form. “running” → “run”, “databases” → “databas”. Enables matching “runs” when searching for “run”.
Postings list construction: for each token, record the document ID and increment the term frequency counter. Sort postings by document ID for merge efficiency.

BM25 Relevance Scoring

Elasticsearch and Lucene default to BM25 (Best Match 25), a TF-IDF variant that addresses two weaknesses of raw TF-IDF: term frequency saturation and document length normalization.

BM25(q,d) = Σ IDF(t) * [TF(t,d) * (k1+1)] / [TF(t,d) + k1*(1 - b + b*|d|/avgdl)]

IDF (Inverse Document Frequency): terms appearing in few documents score higher — rare terms are more discriminating.
TF saturation (k1 parameter, default 1.2): each additional occurrence of a term adds diminishing returns. A document with the term 100 times is not 100x more relevant than one with 1 occurrence.
Length normalization (b parameter, default 0.75): long documents are penalized proportionally to their length. A term appearing once in a short document is more significant than the same term appearing once in a book.

Field Boosting

Not all fields are equally important. A term match in the title should score higher than the same match in the body text, which should score higher than a match in a comment or tag. Configure field-level boost multipliers:

title^3 body^1 tags^2 author^1.5

In Elasticsearch, apply per-field boost in the multi_match query:

{"multi_match": {"query": "user input", "fields": ["title^3", "tags^2", "body^1"]}}

Tune boost values empirically using click-through rate data — fields where clicks correlate with query match should receive higher boosts.

Query Types

Term query: exact match on a single term (after analysis). Used for structured fields (status, category).
Match query: analyzes input text and performs term queries on the result. Standard for full-text fields.
Phrase query: requires terms to appear adjacent and in order. Requires position data in postings. “machine learning” must appear as a phrase, not individual terms.
Boolean query: combine queries with AND/OR/NOT (must/should/must_not in Elasticsearch).
Fuzzy query: matches terms within edit distance N of the query term. Handles typos: “databse” matches “database” with edit distance 1. Expensive — limit to short queries.

Query Parsing

User input (free text, possibly with operators like AND, OR, quotes for phrases) must be parsed into a query AST before execution. A query parser handles:

Quoted strings → phrase queries
Explicit AND/OR → boolean operators
Minus prefix → NOT / must_not
Bare terms → OR by default (should clauses)
Unknown input → sanitize, escape special characters (Lucene special chars: + – = && || ! () {} [] ^ ” ~ * ? : /)

Highlighting

Return matched terms in context to show users why a result was returned:

"...the quick brown fox jumped over the..."

Elasticsearch supports three highlighters: unified (uses BM25 scoring, best quality), plain (re-analyzes stored text, fast), fvh (fast vector highlighter, requires term vectors stored at index time, best for large fields). Highlighting requires stored field values or term vectors — configure at index time.

Near-Real-Time Indexing

Lucene writes new documents to an in-memory buffer. A refresh operation (default every 1 second in Elasticsearch) flushes the buffer to a new in-memory segment, making documents searchable. This is “near-real-time” — documents are visible within ~1 second of indexing, not immediately.

Periodically, segments are flushed to disk (fsync) for durability — default every 30 seconds or when the transaction log (translog) exceeds a size threshold. Background merging combines small segments into larger ones, improving read performance and reclaiming space from deleted documents.

Synonyms and Query Expansion

Users search for “sofa” but documents say “couch.” Without synonyms, no results. Configure a synonym filter in the index analysis chain: sofa, couch, settee → sofa. At query time, expand the query term to all synonyms. Maintain a synonym file updated without full reindex (Elasticsearch supports synonym reloading via _analyze API).

Search Analytics

Log all queries, result sets, and user clicks. Use this data to: identify zero-result queries (improve synonym coverage), identify queries with high bounce rates (relevance problems), compute click-through rate per result position (optimize ranking), and train learning-to-rank models to reorder results based on historical engagement.

Summary

Inverted index maps terms to postings lists (doc_id, frequency, positions).
Construction pipeline: tokenize → lowercase → stopword removal → stemming → build postings.
BM25 scoring with k1=1.2, b=0.75 for TF saturation and length normalization.
Field boosting: title^3 > tags^2 > body^1 — tune via click-through data.
Query parser handles phrases, boolean operators, fuzzy matching, and special character escaping.
Near-real-time: 1-second refresh cycle; fsync every 30 seconds; background segment merging.
Synonyms configured in analysis chain; search query logs drive continuous relevance improvement.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does BM25 scoring improve on TF-IDF?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “BM25 introduces a saturation parameter k1 that prevents term frequency from growing without bound — doubling term occurrences no longer doubles the score — and a field length normalization parameter b that controls how much shorter documents are boosted relative to longer ones. This makes BM25 more robust across varied document lengths and avoids the over-weighting of extremely high-frequency terms that afflicts raw TF-IDF.”
}
},
{
“@type”: “Question”,
“name”: “How does field boosting affect document relevance?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Field boosting multiplies a field's contribution to the final score by a configurable factor, so a query term match in a document title (boost=3) outweighs the same match in the body (boost=1). Boosting is applied either at index time (stored in the index) or at query time (preferred, as it avoids re-indexing when tuning relevance).”
}
},
{
“@type”: “Question”,
“name”: “How are near-real-time index updates implemented?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Lucene-based engines achieve near-real-time (NRT) search by periodically refreshing the index reader to expose newly written segments without performing a full commit to disk; in Elasticsearch the default refresh interval is 1 second. New documents are written to an in-memory buffer and flushed to a searchable segment on refresh, making them visible with sub-second latency while durable commits happen asynchronously.”
}
},
{
“@type”: “Question”,
“name”: “How is query highlighting implemented in a full-text search engine?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Highlighting works by re-analyzing the stored field text (or using term vectors if pre-indexed) to locate the byte offsets of matched terms, then extracting surrounding context fragments scored by term density. Elasticsearch's unified highlighter uses the Lucene postings highlighter, which reads pre-stored term offsets from the index to avoid re-analysis and supports sentence-boundary-aware fragment extraction.”
}
}
]
}