Search Index System Low-Level Design: Inverted Index, BM25 Ranking, and Incremental Indexing Pipeline

Search Index System: Low-Level Design

A search index system converts raw documents into a structure that supports fast full-text queries. Where a relational database answers “give me rows where column equals value,” a search index answers “give me documents that contain these words, ranked by relevance.” This design covers the inverted index data structure, the indexing pipeline from raw document to searchable token, query execution with BM25 ranking, and the incremental update pattern for near-real-time search.

Core Data Model

-- Document store: the source of truth for searchable content
CREATE TABLE SearchDocument (
    doc_id         BIGSERIAL PRIMARY KEY,
    entity_type    VARCHAR(50) NOT NULL,    -- 'job_post', 'user_profile', 'article'
    entity_id      BIGINT NOT NULL,
    title          TEXT,
    body           TEXT,
    tags           TEXT[],
    author_id      BIGINT,
    published_at   TIMESTAMPTZ,
    boost          NUMERIC(4,2) NOT NULL DEFAULT 1.0,  -- manual relevance boost
    indexed_at     TIMESTAMPTZ,
    updated_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE (entity_type, entity_id)
);

-- Inverted index: token → list of (doc_id, positions, field_weight)
CREATE TABLE InvertedIndex (
    token          VARCHAR(200) NOT NULL,
    doc_id         BIGINT NOT NULL REFERENCES SearchDocument(doc_id) ON DELETE CASCADE,
    field          VARCHAR(20) NOT NULL,   -- 'title', 'body', 'tags'
    frequency      INT NOT NULL,           -- term frequency in this doc+field
    positions      INT[] NOT NULL,         -- character offsets for highlighting
    field_weight   NUMERIC(3,2) NOT NULL,  -- title=2.0, tags=1.5, body=1.0
    PRIMARY KEY (token, doc_id, field)
);

-- Index stats for BM25 (updated incrementally)
CREATE TABLE IndexStats (
    stat_key       VARCHAR(100) PRIMARY KEY,
    stat_value     NUMERIC(18,4) NOT NULL,
    updated_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- stat_key examples: 'total_docs', 'avg_body_length', 'df:python', 'df:engineer'

CREATE INDEX ON InvertedIndex(token, field_weight DESC);
CREATE INDEX ON SearchDocument(entity_type, published_at DESC);
CREATE INDEX ON SearchDocument(updated_at) WHERE indexed_at IS NULL OR indexed_at < updated_at;

Indexing Pipeline

import re, math
from collections import Counter
from typing import List, Tuple

STOP_WORDS = {
    'a','an','the','is','are','was','were','be','been','being',
    'have','has','had','do','does','did','will','would','could',
    'should','may','might','must','to','of','in','on','at','for',
    'with','by','from','as','into','through','and','or','but','not'
}

FIELD_WEIGHTS = {'title': 2.0, 'tags': 1.5, 'body': 1.0}

def tokenize(text: str) -> List[Tuple[str, int]]:
    """
    Returns [(token, position), ...].
    Lowercases, removes punctuation, strips stop words, applies simple stemming.
    """
    if not text:
        return []
    text = text.lower()
    text = re.sub(r'[^a-z0-9s]', ' ', text)
    tokens = []
    for pos, word in enumerate(text.split()):
        if word and word not in STOP_WORDS and len(word) > 1:
            token = _stem(word)
            tokens.append((token, pos))
    return tokens

def _stem(word: str) -> str:
    """Minimal suffix stripping (Porter-lite). Replace with NLTK/Snowball in production."""
    for suffix in ('ing', 'tion', 'tions', 'ness', 'ment', 'er', 'est', 'ly', 'ed', 's'):
        if word.endswith(suffix) and len(word) - len(suffix) >= 3:
            return word[:-len(suffix)]
    return word

def index_document(doc_id: int, title: str, body: str, tags: list):
    """
    Build inverted index entries for one document.
    Runs in the indexing worker; idempotent (DELETE + INSERT).
    """
    fields = {
        'title': ' '.join([title or '']),
        'body':  body or '',
        'tags':  ' '.join(tags or []),
    }

    # Remove stale index entries for this doc
    db.execute("DELETE FROM InvertedIndex WHERE doc_id=%s", (doc_id,))

    rows = []
    for field_name, text in fields.items():
        token_positions = tokenize(text)
        # Group positions by token
        token_pos_map: dict[str, list] = {}
        for token, pos in token_positions:
            token_pos_map.setdefault(token, []).append(pos)

        for token, positions in token_pos_map.items():
            rows.append((
                token, doc_id, field_name,
                len(positions),         # term frequency
                positions,              # positions array
                FIELD_WEIGHTS[field_name]
            ))

    if rows:
        db.executemany("""
            INSERT INTO InvertedIndex (token, doc_id, field, frequency, positions, field_weight)
            VALUES (%s, %s, %s, %s, %s, %s)
        """, rows)

    db.execute("""
        UPDATE SearchDocument SET indexed_at=NOW() WHERE doc_id=%s
    """, (doc_id,))
    _update_document_frequency(rows)

def _update_document_frequency(rows: list):
    """Increment df:token for each unique token in the new doc."""
    tokens = {row[0] for row in rows}
    for token in tokens:
        db.execute("""
            INSERT INTO IndexStats (stat_key, stat_value) VALUES (%s, 1)
            ON CONFLICT (stat_key) DO UPDATE
                SET stat_value = IndexStats.stat_value + 1, updated_at=NOW()
        """, (f'df:{token}',))

def indexing_worker():
    """Poll for documents that need indexing and process them."""
    while True:
        docs = db.fetchall("""
            SELECT doc_id, title, body, tags FROM SearchDocument
            WHERE indexed_at IS NULL OR indexed_at < updated_at
            ORDER BY updated_at ASC
            LIMIT 50
            FOR UPDATE SKIP LOCKED
        """)
        if not docs:
            time.sleep(1)
            continue
        for doc in docs:
            index_document(doc['doc_id'], doc['title'], doc['body'], doc['tags'] or [])

BM25 Query Execution

def search(query: str, entity_type: str = None,
           limit: int = 20, offset: int = 0) -> list:
    """
    Full-text search using BM25 ranking.
    BM25 score = sum over query terms of:
        IDF(t) * (TF(t,d) * (k1+1)) / (TF(t,d) + k1*(1-b+b*|d|/avgdl))
    k1=1.5, b=0.75 (standard defaults)
    """
    tokens = [t for t, _ in tokenize(query)]
    if not tokens:
        return []

    stats = _load_stats()
    N = stats.get('total_docs', 1)
    avgdl = stats.get('avg_body_length', 100)
    k1, b = 1.5, 0.75

    # Fetch postings for all query tokens in one query
    placeholders = ','.join(['%s'] * len(tokens))
    postings = db.fetchall(f"""
        SELECT i.token, i.doc_id, i.frequency, i.field_weight,
               s.title, s.entity_type, s.entity_id, s.boost,
               s.published_at,
               (SELECT COALESCE(SUM(frequency),0) FROM InvertedIndex
                WHERE doc_id=i.doc_id AND field='body') AS body_length
        FROM InvertedIndex i
        JOIN SearchDocument s USING (doc_id)
        WHERE i.token IN ({placeholders})
          {'AND s.entity_type=%s' if entity_type else ''}
    """, tokens + ([entity_type] if entity_type else []))

    # Compute BM25 scores per doc
    scores: dict[int, float] = {}
    doc_meta: dict[int, dict] = {}

    for p in postings:
        doc_id = p['doc_id']
        token = p['token']

        df = stats.get(f'df:{token}', 1)
        idf = math.log((N - df + 0.5) / (df + 0.5) + 1)

        tf = p['frequency'] * p['field_weight']  # field-weighted TF
        dl = max(p['body_length'], 1)
        bm25 = idf * (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * dl / avgdl))

        scores[doc_id] = scores.get(doc_id, 0) + bm25 * p['boost']
        doc_meta[doc_id] = {
            'doc_id': doc_id, 'title': p['title'],
            'entity_type': p['entity_type'], 'entity_id': p['entity_id'],
            'published_at': p['published_at'],
        }

    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    results = []
    for doc_id, score in ranked[offset:offset+limit]:
        result = doc_meta[doc_id]
        result['score'] = round(score, 4)
        results.append(result)

    return results

def _load_stats() -> dict:
    rows = db.fetchall("SELECT stat_key, stat_value FROM IndexStats")
    return {r['stat_key']: float(r['stat_value']) for r in rows}

Key Design Decisions

Field-weighted TF: multiplying term frequency by field_weight (title=2.0, body=1.0) before applying BM25 boosts title matches over body matches without requiring a separate index per field. A query matching the title is 2× more relevant than the same match in the body.
Incremental indexing via updated_at: the index worker picks documents where indexed_at IS NULL OR indexed_at < updated_at — documents changed since last indexing. This is efficient for high-update scenarios: only changed documents are re-indexed. The SKIP LOCKED clause allows multiple parallel indexing workers without conflicts.
Document frequency in IndexStats: storing df:token in a stats table avoids a COUNT(DISTINCT doc_id) query per token during search — which would be prohibitively expensive at query time. The trade-off: df values are slightly stale (eventually consistent with a few seconds delay after indexing). For BM25, this is acceptable.
Position arrays for highlighting: storing character positions enables snippet generation — extract the sentence containing the match positions and bold the matched tokens. Without positions, highlighting requires re-scanning the document text at query time.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why use BM25 instead of TF-IDF for search ranking?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”TF-IDF ranks a document higher the more times a query term appears in it — with no saturation limit. A document with the word "python" 100 times scores 10× higher than one with it 10 times, even though the difference in relevance is marginal. BM25 applies a saturation function: term frequency contributes diminishing returns beyond a threshold (controlled by k1, typically 1.5). Additionally, BM25 applies document length normalization (controlled by b, typically 0.75): a short, focused document with 5 occurrences of "python" is ranked higher than a 10,000-word document with the same 5 occurrences, because the term density is higher in the short document. In practice, BM25 produces meaningfully better rankings than TF-IDF with only slightly more computation, which is why Elasticsearch, Lucene, and most production search engines use BM25 as the default.”}},{“@type”:”Question”,”name”:”How do you implement typo-tolerant search without a full Elasticsearch deployment?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Typo tolerance (fuzzy matching) requires finding tokens similar to the query token. Three approaches on Postgres: (1) pg_trgm extension — CREATE INDEX USING GIN (token gin_trgm_ops) on InvertedIndex. Query: WHERE token % ‘pythn’ (trigram similarity) OR token = ‘python’. The % operator uses trigram similarity (0.3 threshold by default) to match close spellings. Fast for short strings; adds ~20–30% overhead per query. (2) Levenshtein distance: SELECT token FROM InvertedIndex WHERE levenshtein(token, ‘pythn’) <= 2. Exact edit distance — works well but does a full scan unless combined with a pre-filter. (3) Soundex/metaphone: phonetic matching (python ≈ pithon) — good for names, poor for technical terms. For most small-to-medium search systems (<10M documents), pg_trgm is the practical choice. Beyond that, Elasticsearch’s native fuzzy query support becomes worth the operational cost.”}},{“@type”:”Question”,”name”:”How do you build a search autocomplete (type-ahead) on top of this index?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Autocomplete needs prefix matching: as the user types "pyth", return suggestions containing tokens starting with "pyth". The inverted index stores exact tokens, not prefixes — it can’t answer prefix queries efficiently. Autocomplete-specific additions: (1) add a trie or prefix-sorted index: CREATE INDEX ON InvertedIndex(token text_pattern_ops) for LIKE ‘pyth%’ queries; (2) maintain a separate SuggestToken table: top N tokens sorted by document_frequency DESC — autocomplete pulls from this table, not the full index; (3) Redis sorted set: load popular tokens into a sorted set keyed on their prefix (using lexicographic scoring). ZRANGEBYLEX gives all suggestions with a given prefix in O(log N). For a dedicated autocomplete path, the Redis sorted set approach adds <1ms latency vs. 5–20ms for a Postgres prefix scan. Combine with recent user searches (personalization) weighted higher than global frequency.”}},{“@type”:”Question”,”name”:”How do you handle multilingual search across documents in different languages?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Multilingual search requires language-specific tokenization and stemming: the English stemmer collapses "running" and "run," but applying it to French text produces garbage. Three approaches: (1) Language-per-field: detect the document’s language at index time, apply the appropriate tokenizer and stemmer, and store language=’fr’ in SearchDocument. At query time, detect the query language and use the matching stemmer. Postgres supports multiple dictionaries: to_tsvector(‘french’, body). (2) Parallel indexes: index each document in both English and the detected language; rank the English match and language-specific match separately and merge. (3) Unicode normalization: always apply NFD normalization and strip diacritics before tokenizing — "café" → "cafe" — so accent variations match regardless of input method. For a simpler system: detect the top 2–3 languages in your corpus and build separate index columns per language. Cross-language search (query in English, document in French) requires machine translation — out of scope for most product search systems.”}},{“@type”:”Question”,”name”:”How do you update the search index when a document’s content changes frequently?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”High-update documents (product prices changing hourly, user profiles edited daily) create indexing load spikes if every field change triggers a full re-index. Optimizations: (1) Debounce indexing: when a document is updated, set indexed_at=NULL but don’t immediately re-index — the indexing worker picks it up on the next poll cycle (1–5 seconds). Multiple updates within that window collapse into one indexing operation. (2) Field-level delta: only re-index the fields that changed. Track changed_fields in SearchDocument and delete+re-insert only those fields’ InvertedIndex rows. (3) Soft fields: for fields that change constantly (price, stock count) but aren’t part of the inverted index (users don’t search for prices), store them in SearchDocument.metadata JSONB and return them in results without re-indexing when they change. The inverted index only needs re-building when searchable text fields (title, body, tags) change.”}}]}

Search index and full-text search system design is discussed in LinkedIn system design interview questions.

Search index and relevance ranking system design is covered in Google system design interview preparation.

Search index and enterprise content search design is discussed in Atlassian system design interview guide.