System Design Interview: Design a Web Search Engine
Designing a web search engine is one of the most technically rich system design questions. It requires understanding web crawling, large-scale indexing, ranking algorithms, and low-latency query serving. This guide covers the architecture used by systems like Google and Bing, distilled into a framework you can present confidently in an interview.
Clarifying Questions
- Scale: how many web pages to index? (Assume 50 billion pages)
- Query volume: how many searches per second? (Assume 100K QPS)
- Freshness: how quickly should new content appear in results? (Minutes for news, days for general content)
- Features: full-text search only, or also images, videos, knowledge panels?
Core Components
1. Web Crawler
The crawler discovers and downloads web pages. Key design decisions:
- Seed URLs: start from a curated list of high-authority domains
- URL Frontier: priority queue of URLs to crawl, prioritized by PageRank estimate and freshness
- Politeness: respect robots.txt, limit crawl rate per domain (1 req/sec default)
- Distributed crawlers: partition URL space by domain hash across thousands of crawler nodes
- Deduplication: use SimHash to detect near-duplicate pages and skip them
At 50B pages with avg 100KB per page = 5 PB of raw HTML. Crawlers must revisit pages; revisit frequency is proportional to page change rate (detected via ETag/Last-Modified headers).
2. Document Processor
After downloading, each page is processed:
- HTML parsing: extract text, links, metadata (title, description, canonical URL)
- Link extraction: add discovered URLs to the URL frontier
- Language detection: route to language-specific indexes
- Content extraction: remove boilerplate (navigation, ads) to isolate main content
- Indexing pipeline: tokenize text, stem words, compute term frequencies
3. Inverted Index
The inverted index maps each term to the list of documents containing it (a “posting list”):
"distributed" -> [(doc_12, tf=3, positions=[45,102,890]), (doc_47, tf=1, positions=[23]), ...]
"systems" -> [(doc_12, tf=7, positions=[...]), ...]
Storage: the index at Google scale is ~100+ PB. It is sharded by document ID range and replicated across datacenters. Each shard is stored as an immutable file (similar to LSM-tree SSTables) updated via periodic merges.
Index tiers: a three-tier architecture balances freshness vs. cost:
- Realtime index: last 24-48 hours of crawled content, kept in memory, updated continuously
- Recent index: last 30 days, on fast SSDs
- Full index: all 50B pages, on HDD/object storage, rebuilt weekly
4. Ranking System
Given a query, thousands of documents match. Ranking determines the order. The process has two stages:
Retrieval (recall): fetch top-1000 candidate documents from the inverted index using BM25 (term frequency × inverse document frequency). BM25 score for a document D given query Q:
score(D,Q) = Σ IDF(qi) × (tf(qi,D) × (k1+1)) / (tf(qi,D) + k1×(1 - b + b×|D|/avgdl))
Where k1≈1.5, b≈0.75 are tuning constants, |D| is document length, avgdl is average document length.
Ranking (precision): rerank the 1000 candidates using a Learning to Rank model with hundreds of signals:
- Link signals: PageRank (recursive authority from inbound links), domain authority
- Content signals: query term density, title match, URL match, freshness
- User signals: CTR from past queries, dwell time, bounce rate
- Quality signals: HTTPS, mobile-friendly, page speed, Core Web Vitals
Google uses a neural ranker (MUM/BERT-based) to understand query intent beyond keyword matching.
5. Query Serving
For 100K QPS with <200ms p99 latency:
- Query understanding: spell correction, query expansion, entity recognition, intent classification (navigational / informational / transactional)
- Fan-out: query dispatcher sends the query to all index shards in parallel; each shard returns its top-K results; dispatcher merges and re-ranks globally
- Caching: popular queries (top 5% cover ~50% of traffic) are cached in Redis with a short TTL (minutes for news-sensitive queries, hours for stable queries)
- Snippets: extract the most relevant sentence around query terms from the document for the result description
6. PageRank Algorithm
PageRank models the probability that a random web surfer ends up on a page by following links. Formula:
PR(A) = (1-d)/N + d × Σ PR(Bi)/L(Bi)
Where d≈0.85 (damping factor), N is total pages, Bi are pages linking to A, L(Bi) is the number of outbound links from Bi. Computed iteratively via MapReduce until convergence (~50 iterations for the web graph). At 50B pages this is a massive distributed computation run periodically (weekly) and stored per-document.
Data Flow Summary
Web → Crawler → Document Processor → Inverted Index Builder
↓
User Query → Query Understanding → Index Shards (parallel) → Merge & Rank → Results
Scalability and Reliability
- Crawl at scale: 50B pages ÷ 30-day crawl cycle = ~20K pages/second. Achieved with thousands of distributed crawler nodes
- Fault tolerance: each index shard is replicated 3x. If a shard replica fails, another serves the query; results are slightly degraded but not unavailable
- Datacenter redundancy: queries are routed to the nearest datacenter; if unavailable, traffic fails over to another region
- Index consistency: eventual consistency is acceptable — a page not appearing for a few hours is tolerable; serving stale results is better than downtime
Interview Tips
- Start with the data pipeline: crawl → process → index → rank → serve
- Know BM25 at a high level — you don't need the exact formula but mention TF-IDF and document length normalization
- Explain the two-stage retrieval+ranking pattern — interviewers appreciate knowing you understand recall vs. precision
- Discuss the index tier architecture (realtime / recent / full) to show you understand freshness-cost trade-offs
- If asked about anti-spam, mention link farm detection, trust scores for domains, and manual quality rater programs
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does an inverted index work in a search engine?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “An inverted index maps each term to the list of documents containing it (posting lists). When you query "distributed systems", the engine fetches the posting list for "distributed" and "systems", intersects them, and ranks the results. At web scale (50B pages), the index is sharded by document ID range across thousands of machines, each holding a slice of every posting list. A query fans out to all shards in parallel, each returns its top-K results, and a central aggregator merges and re-ranks globally.” }
},
{
“@type”: “Question”,
“name”: “What is the two-stage retrieval and ranking pipeline in search?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Stage 1 (retrieval/recall): use BM25 (TF-IDF with document length normalization) to fetch the top ~1000 candidate documents from the inverted index. BM25 is fast and runs on every shard in parallel. Stage 2 (ranking/precision): a Learning to Rank model re-scores the 1000 candidates using hundreds of signals: PageRank, query-title match, freshness, HTTPS, mobile-friendliness, Core Web Vitals, and user engagement signals (CTR, dwell time). This two-stage approach balances recall (get enough candidates) with precision (rank the best ones first).” }
},
{
“@type”: “Question”,
“name”: “How does PageRank work and why does it matter for search?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “PageRank models the probability a random web surfer lands on a page by following links. Formula: PR(A) = (1-d)/N + d * sum(PR(Bi)/L(Bi)), where d=0.85 is the damping factor, Bi are pages linking to A, and L(Bi) is the number of outbound links from Bi. Computed iteratively (50+ MapReduce rounds) on the full web graph weekly. Pages with many high-quality inbound links get high PageRank. It is one of hundreds of ranking signals but provides a strong prior for authority before considering query-specific relevance.” }
}
]
}