Q: How do you build a search autocomplete (type-ahead) on top of this index?

Autocomplete needs prefix matching: as the user types "pyth", return suggestions containing tokens starting with "pyth". The inverted index stores exact tokens, not prefixes — it can't answer prefix queries efficiently. Autocomplete-specific additions: (1) add a trie or prefix-sorted index: CREATE INDEX ON InvertedIndex(token text_pattern_ops) for LIKE 'pyth%' queries; (2) maintain a separate SuggestToken table: top N tokens sorted by document_frequency DESC — autocomplete pulls from this table, not the full index; (3) Redis sorted set: load popular tokens into a sorted set keyed on their prefix (using lexicographic scoring). ZRANGEBYLEX gives all suggestions with a given prefix in O(log N). For a dedicated autocomplete path, the Redis sorted set approach adds <1ms latency vs. 5–20ms for a Postgres prefix scan. Combine with recent user searches (personalization) weighted higher than global frequency.

Q: How do you handle multilingual search across documents in different languages?

Multilingual search requires language-specific tokenization and stemming: the English stemmer collapses "running" and "run," but applying it to French text produces garbage. Three approaches: (1) Language-per-field: detect the document's language at index time, apply the appropriate tokenizer and stemmer, and store language='fr' in SearchDocument. At query time, detect the query language and use the matching stemmer. Postgres supports multiple dictionaries: to_tsvector('french', body). (2) Parallel indexes: index each document in both English and the detected language; rank the English match and language-specific match separately and merge. (3) Unicode normalization: always apply NFD normalization and strip diacritics before tokenizing — "café" → "cafe" — so accent variations match regardless of input method. For a simpler system: detect the top 2–3 languages in your corpus and build separate index columns per language. Cross-language search (query in English, document in French) requires machine translation — out of scope for most product search systems.

Q: How do you update the search index when a document's content changes frequently?

High-update documents (product prices changing hourly, user profiles edited daily) create indexing load spikes if every field change triggers a full re-index. Optimizations: (1) Debounce indexing: when a document is updated, set indexed_at=NULL but don't immediately re-index — the indexing worker picks it up on the next poll cycle (1–5 seconds). Multiple updates within that window collapse into one indexing operation. (2) Field-level delta: only re-index the fields that changed. Track changed_fields in SearchDocument and delete+re-insert only those fields' InvertedIndex rows. (3) Soft fields: for fields that change constantly (price, stock count) but aren't part of the inverted index (users don't search for prices), store them in SearchDocument.metadata JSONB and return them in results without re-indexing when they change. The inverted index only needs re-building when searchable text fields (title, body, tags) change.

Question 1

Why use BM25 instead of TF-IDF for search ranking?

Accepted Answer

TF-IDF ranks a document higher the more times a query term appears in it — with no saturation limit. A document with the word "python" 100 times scores 10× higher than one with it 10 times, even though the difference in relevance is marginal. BM25 applies a saturation function: term frequency contributes diminishing returns beyond a threshold (controlled by k1, typically 1.5). Additionally, BM25 applies document length normalization (controlled by b, typically 0.75): a short, focused document with 5 occurrences of "python" is ranked higher than a 10,000-word document with the same 5 occurrences, because the term density is higher in the short document. In practice, BM25 produces meaningfully better rankings than TF-IDF with only slightly more computation, which is why Elasticsearch, Lucene, and most production search engines use BM25 as the default.

Question 2

How do you implement typo-tolerant search without a full Elasticsearch deployment?

Accepted Answer

Typo tolerance (fuzzy matching) requires finding tokens similar to the query token. Three approaches on Postgres: (1) pg_trgm extension — CREATE INDEX USING GIN (token gin_trgm_ops) on InvertedIndex. Query: WHERE token % 'pythn' (trigram similarity) OR token = 'python'. The % operator uses trigram similarity (0.3 threshold by default) to match close spellings. Fast for short strings; adds ~20–30% overhead per query. (2) Levenshtein distance: SELECT token FROM InvertedIndex WHERE levenshtein(token, 'pythn') <= 2. Exact edit distance — works well but does a full scan unless combined with a pre-filter. (3) Soundex/metaphone: phonetic matching (python ≈ pithon) — good for names, poor for technical terms. For most small-to-medium search systems (<10M documents), pg_trgm is the practical choice. Beyond that, Elasticsearch's native fuzzy query support becomes worth the operational cost.

Question 3

How do you build a search autocomplete (type-ahead) on top of this index?

Accepted Answer

Autocomplete needs prefix matching: as the user types "pyth", return suggestions containing tokens starting with "pyth". The inverted index stores exact tokens, not prefixes — it can't answer prefix queries efficiently. Autocomplete-specific additions: (1) add a trie or prefix-sorted index: CREATE INDEX ON InvertedIndex(token text_pattern_ops) for LIKE 'pyth%' queries; (2) maintain a separate SuggestToken table: top N tokens sorted by document_frequency DESC — autocomplete pulls from this table, not the full index; (3) Redis sorted set: load popular tokens into a sorted set keyed on their prefix (using lexicographic scoring). ZRANGEBYLEX gives all suggestions with a given prefix in O(log N). For a dedicated autocomplete path, the Redis sorted set approach adds <1ms latency vs. 5–20ms for a Postgres prefix scan. Combine with recent user searches (personalization) weighted higher than global frequency.

Question 4

How do you handle multilingual search across documents in different languages?

Accepted Answer

Multilingual search requires language-specific tokenization and stemming: the English stemmer collapses "running" and "run," but applying it to French text produces garbage. Three approaches: (1) Language-per-field: detect the document's language at index time, apply the appropriate tokenizer and stemmer, and store language='fr' in SearchDocument. At query time, detect the query language and use the matching stemmer. Postgres supports multiple dictionaries: to_tsvector('french', body). (2) Parallel indexes: index each document in both English and the detected language; rank the English match and language-specific match separately and merge. (3) Unicode normalization: always apply NFD normalization and strip diacritics before tokenizing — "café" → "cafe" — so accent variations match regardless of input method. For a simpler system: detect the top 2–3 languages in your corpus and build separate index columns per language. Cross-language search (query in English, document in French) requires machine translation — out of scope for most product search systems.

Question 5

How do you update the search index when a document's content changes frequently?

Accepted Answer

High-update documents (product prices changing hourly, user profiles edited daily) create indexing load spikes if every field change triggers a full re-index. Optimizations: (1) Debounce indexing: when a document is updated, set indexed_at=NULL but don't immediately re-index — the indexing worker picks it up on the next poll cycle (1–5 seconds). Multiple updates within that window collapse into one indexing operation. (2) Field-level delta: only re-index the fields that changed. Track changed_fields in SearchDocument and delete+re-insert only those fields' InvertedIndex rows. (3) Soft fields: for fields that change constantly (price, stock count) but aren't part of the inverted index (users don't search for prices), store them in SearchDocument.metadata JSONB and return them in results without re-indexing when they change. The inverted index only needs re-building when searchable text fields (title, body, tags) change.

Search Index System Low-Level Design: Inverted Index, BM25 Ranking, and Incremental Indexing Pipeline

Search Index System: Low-Level Design

Core Data Model

Indexing Pipeline

BM25 Query Execution

Key Design Decisions