Inverted Index Low-Level Design: Tokenization, Posting Lists, TF-IDF Scoring, and Index Updates

What Is an Inverted Index?

An inverted index maps each term (word or token) to the list of documents that contain it. It is the data structure behind every full-text search engine — from Elasticsearch to PostgreSQL's tsvector. “Inverted” refers to the inversion of the document-to-terms direction: instead of asking “what terms does document D contain?”, the index answers “which documents contain term T?”

The two core components are the term dictionary (a sorted array or hash map mapping each term to a pointer into the posting list storage) and the posting lists themselves (the actual per-term document lists on disk).

Tokenization Pipeline

Before text can be indexed, it must be transformed into a normalized sequence of terms through a pipeline:

Lowercasing — fold case so “Database” and “database” map to the same term.
Tokenization — split on whitespace and punctuation to produce individual word tokens.
Stopword removal — discard high-frequency words (“the”, “a”, “is”) that carry no search signal and would produce enormous posting lists.
Stemming or lemmatization — reduce words to their root form (“running” → “run”, “databases” → “databas”). Stemming uses rule-based truncation; lemmatization uses a lexicon for morphological correctness.
Normalization — handle Unicode, diacritics, punctuation, hyphenation, and abbreviations.

The same pipeline must be applied at both index time and query time to ensure query terms match the indexed terms.

Posting List Structure

Each term's posting list is a sorted sequence of entries. A minimal entry contains just a document ID. A richer entry includes:

doc_id — document identifier (sorted ascending for merge efficiency)
term_frequency — how many times the term appears in the document
position_list — the byte or word offsets of each occurrence (required for phrase queries)

Posting lists are delta-encoded: instead of storing raw doc_ids [100, 200, 350], store deltas [100, 100, 150]. Deltas are typically small and compress well. Delta-encoded lists are then compressed with VByte (variable-length byte encoding) or PForDelta (SIMD-friendly frame-of-reference encoding).

TF-IDF Scoring

Term Frequency-Inverse Document Frequency (TF-IDF) assigns a relevance score to each (term, document) pair:

TF(t, d) = freq(t, d) / total_terms(d) — how often the term appears in the document, normalized by document length
IDF(t) = log(N / df(t)) — log of total documents N divided by the number of documents containing the term. Rare terms get higher IDF.
Score = TF(t, d) × IDF(t), summed over all query terms

TF-IDF is simple but has known weaknesses: it does not handle document length normalization well, and term saturation (a document mentioning a term 100 times should not score 100× higher than one mentioning it once).

BM25 Scoring

BM25 (Best Match 25) addresses TF-IDF's shortcomings with two parameters: k1 (term saturation, typically 1.2–2.0) and b (length normalization, typically 0.75):

BM25(t, d) = IDF(t) * (TF * (k1 + 1)) / (TF + k1 * (1 - b + b * dl / avgdl))

Where dl is the document length and avgdl is the average document length across the corpus. BM25 is the default ranking function in Elasticsearch, Lucene, and most modern search engines. It is empirically stronger than TF-IDF on information retrieval benchmarks.

Phrase and Boolean Queries

Phrase query (“write ahead log”): retrieve posting lists for each term, intersect by doc_id, then verify that the terms appear at consecutive positions in the position_list. This requires position data in the posting list.

Boolean AND: merge two sorted posting lists using a two-pointer algorithm — advance the pointer on the smaller doc_id list; emit a doc_id only when both pointers agree. O(|A| + |B|) time.

Boolean OR: emit every doc_id from either list, deduplicating. O(|A| + |B|) time.

Boolean NOT: iterate the positive list and skip any doc_ids present in the negated list. Requires the negated list to be sorted.

Near-Real-Time Index Updates

LSM-style inverted indexes use an in-memory segment (index buffer) that accepts new documents immediately. Queries search both the in-memory segment and on-disk segments simultaneously. Periodically the in-memory segment is flushed to disk as a new segment file. Multiple small segments are merged (compacted) into larger segments to keep query fanout manageable — the same principle as LSM compaction.

Deleted documents are handled with a tombstone bitmap or a delete list; the actual posting list entries are removed during segment merge. This is why deletions in Elasticsearch are not immediately reflected in segment-level statistics.

SQL Data Model

-- Document registry
CREATE TABLE IndexedDocument (
    id                      BIGSERIAL PRIMARY KEY,
    doc_id                  VARCHAR(255) NOT NULL UNIQUE,
    tokenized_terms_count   INT NOT NULL,
    indexed_at              TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Posting list entries
CREATE TABLE PostingList (
    term        VARCHAR(255) NOT NULL,
    doc_id      VARCHAR(255) NOT NULL,
    term_freq   INT NOT NULL,
    positions   JSONB,          -- array of term positions within the document
    indexed_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    PRIMARY KEY (term, doc_id)
);

-- Index segments (on-disk SSTable-like files)
CREATE TABLE IndexSegment (
    id          BIGSERIAL PRIMARY KEY,
    doc_count   INT NOT NULL,
    term_count  INT NOT NULL,
    file_path   VARCHAR(1024) NOT NULL,
    merged_at   TIMESTAMPTZ,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_posting_term ON PostingList(term);
CREATE INDEX idx_posting_doc ON PostingList(doc_id);

Python Implementation Sketch

import math, re
from collections import defaultdict
from typing import Iterable

STOPWORDS = {"the", "a", "an", "is", "in", "of", "and", "or", "to", "for"}

def tokenize(text: str) -> list[str]:
    """Lowercase, tokenize, remove stopwords, apply basic stemming."""
    tokens = re.findall(r"[a-z]+", text.lower())
    return [stem(t) for t in tokens if t not in STOPWORDS]

def stem(word: str) -> str:
    """Minimal suffix-stripping stemmer."""
    for suffix in ("ing", "tion", "ed", "ly", "es", "s"):
        if word.endswith(suffix) and len(word) - len(suffix) > 2:
            return word[:-len(suffix)]
    return word

def build_posting_list(docs: dict[str, str]) -> dict[str, list[tuple[str, int, list[int]]]]:
    """Build inverted index from {doc_id: text}. Returns {term: [(doc_id, freq, positions)]}."""
    index: dict[str, dict[str, list[int]]] = defaultdict(lambda: defaultdict(list))
    for doc_id, text in docs.items():
        tokens = tokenize(text)
        for pos, token in enumerate(tokens):
            index[token][doc_id].append(pos)
    posting: dict[str, list[tuple[str, int, list[int]]]] = {}
    for term, doc_map in index.items():
        posting[term] = sorted(
            [(doc_id, len(positions), positions) for doc_id, positions in doc_map.items()]
        )
    return posting

def bm25_score(query_terms: list[str], doc_id: str,
               posting: dict, stats: dict,
               k1: float = 1.5, b: float = 0.75) -> float:
    """Compute BM25 score for doc_id given query_terms."""
    N = stats["total_docs"]
    avgdl = stats["avg_doc_length"]
    dl = stats.get("doc_lengths", {}).get(doc_id, avgdl)
    score = 0.0
    for term in query_terms:
        if term not in posting:
            continue
        df = len(posting[term])
        idf = math.log((N - df + 0.5) / (df + 0.5) + 1)
        entry = next(((d, tf, _) for d, tf, _ in posting[term] if d == doc_id), None)
        if entry is None:
            continue
        tf = entry[1]
        score += idf * (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * dl / avgdl))
    return score

def boolean_query(terms: list[str], operator: str,
                  posting: dict) -> list[str]:
    """AND/OR boolean query. Returns sorted list of matching doc_ids."""
    if not terms:
        return []
    lists = [set(d for d, _, _ in posting.get(t, [])) for t in terms]
    if operator == "AND":
        result = lists[0]
        for s in lists[1:]:
            result = result & s
    else:  # OR
        result = set()
        for s in lists:
            result |= s
    return sorted(result)

def merge_segments(segment_ids: list[int], posting_lists: dict) -> dict:
    """Merge multiple index segments into one by combining posting lists."""
    merged: dict[str, list] = defaultdict(list)
    for seg_id in segment_ids:
        seg = posting_lists.get(seg_id, {})
        for term, entries in seg.items():
            merged[term].extend(entries)
    # Sort and deduplicate by doc_id within each term
    for term in merged:
        seen = {}
        for doc_id, freq, positions in merged[term]:
            if doc_id not in seen:
                seen[doc_id] = (freq, positions)
        merged[term] = sorted(
            [(d, f, p) for d, (f, p) in seen.items()]
        )
    return dict(merged)

Frequently Asked Questions

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between BM25 and TF-IDF?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “BM25 improves on TF-IDF in two ways. First, term saturation: TF-IDF scores grow linearly with term frequency, so a document mentioning a term 100 times scores 100x higher than one mentioning it once. BM25 uses a saturation curve controlled by k1 — scores increase with term frequency but plateau. Second, length normalization: TF-IDF raw term frequency favors long documents. BM25 normalizes by document length relative to the corpus average, controlled by parameter b. BM25 consistently outperforms TF-IDF on standard information retrieval benchmarks and is the default in Elasticsearch and Lucene.”
}
},
{
“@type”: “Question”,
“name”: “How are posting lists compressed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Posting lists are first delta-encoded: instead of storing raw doc_ids [100, 200, 350], store gaps [100, 100, 150]. Gaps are typically small and fit in fewer bits than the original IDs. The delta-encoded list is then compressed with VByte (variable-length byte encoding — small numbers use 1-2 bytes instead of 4-8) or PForDelta (frame-of-reference encoding that packs 128 integers using SIMD-friendly bit packing). Lucene's default is FOR-delta (Frame Of Reference). Compression ratios of 4-10x are common on natural language text.”
}
},
{
“@type”: “Question”,
“name”: “How does phrase query position matching work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A phrase query like ‘write ahead log’ requires the three terms to appear in adjacent positions within the same document. The engine retrieves the posting lists for all three terms, intersects them by doc_id, and then for each matching doc_id, checks the position lists. For a two-term phrase (A B), position p_A must equal p_B – 1. For a three-term phrase (A B C), positions must satisfy p_A + 1 = p_B and p_B + 1 = p_C. This position verification requires that position data be stored in the posting list, which increases index size by roughly 2-3x compared to storing only doc_id and frequency.”
}
},
{
“@type”: “Question”,
“name”: “How does near-real-time indexing work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Near-real-time (NRT) indexing maintains an in-memory segment that accepts new documents immediately. Queries search both the in-memory segment and all on-disk segments simultaneously. Periodically (e.g., every 1 second in Elasticsearch), the in-memory segment is flushed to a new disk segment, making the new documents searchable without a full commit. Multiple small segments are merged in the background to reduce query fanout. Deleted documents are tracked in a per-segment delete bitmap; their posting list entries are removed during segment merge. This architecture allows sub-second indexing latency without sacrificing query correctness.”
}
}
]
}

BM25 vs TF-IDF

BM25 adds term saturation (scores plateau at high term frequency, controlled by k1) and length normalization (controlled by b). TF-IDF grows linearly with term frequency and favors long documents. BM25 consistently outperforms TF-IDF on IR benchmarks and is the default in Elasticsearch and Lucene.

Posting List Compression

Delta-encode doc_ids first (store gaps, not raw IDs), then compress with VByte or PForDelta. Gaps are typically small integers that fit in 1-2 bytes. Compression ratios of 4-10x are common on natural language text.

Phrase Query Position Matching

After intersecting posting lists by doc_id, verify that terms appear at consecutive positions. For “write ahead log”: positions must satisfy p_write + 1 = p_ahead and p_ahead + 1 = p_log. Position data increases index size by ~2-3x but is required for phrase queries.

Near-Real-Time Index Updates

New documents land in an in-memory segment, visible to queries immediately. Periodically the segment is flushed to disk (every ~1 second in Elasticsearch). Background merges consolidate small segments. Deleted documents use a per-segment bitmap; entries are removed during merge.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does BM25 improve on TF-IDF for document ranking?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “BM25 adds length normalization (shorter documents score higher for the same term frequency) and term frequency saturation (additional occurrences of a term yield diminishing score increases); these make BM25 more robust across documents of varying lengths.”
}
},
{
“@type”: “Question”,
“name”: “How are posting lists compressed to reduce storage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Doc IDs are stored as sorted delta-encoded integers (gaps between consecutive IDs); gaps are then variable-length encoded (VByte) where small numbers use fewer bytes; typical compression ratios are 4-10x over uncompressed posting lists.”
}
},
{
“@type”: “Question”,
“name”: “How does a phrase query use position information?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Positions of query terms are intersected from their posting lists; a phrase match requires all terms to appear in the same document at consecutive positions (position(term2) = position(term1) + 1 for adjacent words).”
}
},
{
“@type”: “Question”,
“name”: “How does near-real-time search work with segment-based indexing?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “New documents are written to an in-memory segment (buffer); the buffer is periodically flushed to a small on-disk segment (commit); background merging combines small segments into larger ones; queries search all segments (in-memory + disk) and merge results.”
}
}
]
}