Question 1

How does an inverted index enable fast full-text search?

Accepted Answer

An inverted index maps each term to a sorted list of documents containing that term (the postings list). Example: "algorithm" -> [(doc3, freq:2, pos:[10,45]), (doc7, freq:1, pos:[3])]. To answer a multi-term query "algorithm interview": look up both terms, retrieve their postings lists, intersect the two sorted lists using a merge algorithm (O(n) where n is the shorter list). The result is documents containing both terms. Scoring: for each matched document, compute TF-IDF or BM25 score. Return top-K by score. The inverted index trades storage (the index can be as large as the original corpus) for query speed (from O(n * doc_size) linear scan to O(postings_list_size)).

Question 2

What is BM25 and why is it better than TF-IDF?

Accepted Answer

BM25 (Best Match 25) improves on TF-IDF with two key additions: (1) Term frequency saturation: in TF-IDF, doubling term frequency doubles the score. In BM25, the TF component saturates: score(tf) = tf * (k1 + 1) / (tf + k1) where k1=1.2-2.0. Going from 1 to 2 occurrences gives more boost than going from 10 to 11. (2) Document length normalization: longer documents naturally contain more terms. BM25 penalizes long documents: tf_normalized = tf / (1 - b + b * doc_len / avg_doc_len) where b=0.75. This prevents long documents from dominating search results just because they have more words. BM25 has been the standard baseline for over 25 years and is used by Elasticsearch, Lucene, and most enterprise search systems.

Question 3

How does a web crawler work at scale?

Accepted Answer

A web crawler starts with a seed set of URLs in a frontier queue. The crawl loop: (1) Dequeue a URL. (2) Check if already crawled (URL fingerprint in a distributed hash set like Redis). (3) Fetch the page, respecting robots.txt and crawl-delay directives. (4) Parse HTML, extract links, enqueue new URLs. (5) Store the fetched content for indexing. At scale: the frontier is partitioned by domain (each crawler worker owns specific domains) to respect per-domain rate limits and avoid IP bans. Prioritization: high-PageRank or frequently-updated pages are re-crawled more often. Deduplication: near-duplicate detection using SimHash to avoid indexing the same content multiple times. Google's crawlers process billions of pages per day across tens of thousands of servers.

Question 4

How do you handle real-time index updates in a search engine?

Accepted Answer

Two approaches: (1) Batch rebuild: rebuild the entire index nightly from a full corpus snapshot. Simple but index is 24 hours stale. (2) Near-real-time indexing: maintain a small in-memory "delta index" of documents added or updated in the last few minutes. On query: search both the main index and the delta index, merge results. Periodically merge the delta index into the main index. Elasticsearch uses this approach: new documents are first written to an in-memory buffer, flushed to a new Lucene segment every second (near-real-time), then segments are periodically merged. Deletion is handled with a tombstone bit-set - deleted documents are filtered from results until the next segment merge removes them permanently.

Question 5

How is a search index sharded across multiple machines?

Accepted Answer

Document sharding: split the corpus into N shards (e.g., by URL hash or document ID). Each shard contains a complete inverted index for its portion of the corpus. On a query: broadcast the query to all shards in parallel, each shard returns its top-K results with scores, a coordinator merges all N*K results and returns the global top-K. This is the "scatter-gather" or "fan-out" pattern. Term-based sharding (each machine handles specific terms) is an alternative but creates hot shards for common terms. N shards * parallelism = latency of one shard (O(1) scaling). Scale: Google uses thousands of shards, each shard is a ~1TB inverted index stored on a cluster of machines for redundancy.

Search Engine System Low-Level Design

Overview

Web Crawler Design

Document Processing

Inverted Index Structure

TF-IDF Ranking

BM25

Query Processing

Index Update Strategies

PageRank

Scale