System Design Interview: Search Engine (Google-Scale Web Search)

Designing a web search engine is one of the most complex system design problems — it combines large-scale crawling, indexing billions of documents, real-time query processing, and ML-powered ranking. This question appears at Google, Microsoft Bing, and companies building internal search.

Requirements Clarification

  • Scale: Index 10B web pages, serve 100k queries/second, freshness latency <24h for new pages
  • Latency: Query response <200ms p99
  • Relevance: Results ranked by relevance (text match + authority + freshness + user signals)
  • Features: Full-text search, spell correction, autocomplete, featured snippets, image/news verticals

High-Level Pipeline

"""
Search engine data flow:

1. Crawl: Web Crawler discovers and downloads pages
   URL Frontier → Fetcher → Robots.txt check → HTML parser
   → Link extractor → DNS resolver → Content store

2. Index: Process downloaded pages into searchable format
   HTML → text extraction → tokenization → stemming/lemmatization
   → Build inverted index → Compute document features (PageRank, etc.)
   → Distribute index shards across serving nodes

3. Serve: Answer queries in real time
   Query → spell-correct → query parse → index lookup (multi-shard)
   → merge + score → personalize → rerank → serve top-k results

Separation matters: indexing pipeline runs continuously in background;
query serving must respond in <200ms. Never mix these paths.
"""

Inverted Index

from collections import defaultdict
from dataclasses import dataclass, field
from typing import List, Dict, Iterator
import math
import heapq

@dataclass
class Posting:
    doc_id:    int
    tf:        float       # Term frequency in document
    positions: List[int]   # Token positions for phrase queries

@dataclass
class IndexEntry:
    term:        str
    df:          int        # Document frequency (how many docs contain term)
    postings:    List[Posting]   # Sorted by doc_id for merge

class InMemoryIndex:
    """Simplified inverted index. Production uses on-disk SST-files (Lucene segments)."""

    def __init__(self):
        self.index: Dict[str, IndexEntry] = {}
        self.doc_count = 0
        self.doc_lengths: Dict[int, int] = {}  # doc_id -> total tokens

    def add_document(self, doc_id: int, tokens: List[str]):
        self.doc_count += 1
        self.doc_lengths[doc_id] = len(tokens)

        # Build term → positions map
        positions: Dict[str, List[int]] = defaultdict(list)
        for pos, token in enumerate(tokens):
            positions[token].append(pos)

        for term, pos_list in positions.items():
            tf = len(pos_list) / len(tokens)  # Relative frequency
            posting = Posting(doc_id=doc_id, tf=tf, positions=pos_list)

            if term not in self.index:
                self.index[term] = IndexEntry(term=term, df=0, postings=[])
            entry = self.index[term]
            entry.df += 1
            entry.postings.append(posting)

    def get_postings(self, term: str) -> List[Posting]:
        entry = self.index.get(term)
        return entry.postings if entry else []

    def df(self, term: str) -> int:
        entry = self.index.get(term)
        return entry.df if entry else 0

class BM25Scorer:
    """
    BM25: industry-standard relevance scoring (improvement over TF-IDF).
    k1=1.5: term frequency saturation parameter
    b=0.75:  document length normalization
    """
    def __init__(self, index: InMemoryIndex, k1: float = 1.5, b: float = 0.75):
        self.index = index
        self.k1 = k1
        self.b = b
        self.avg_doc_len = (
            sum(index.doc_lengths.values()) / max(len(index.doc_lengths), 1)
        )

    def idf(self, term: str) -> float:
        """Inverse document frequency — rare terms are more informative."""
        df = self.index.df(term)
        n  = self.index.doc_count
        if df == 0:
            return 0.0
        return math.log((n - df + 0.5) / (df + 0.5) + 1)

    def score(self, doc_id: int, doc_length: int, term: str, tf: float) -> float:
        idf = self.idf(term)
        norm_tf = (tf * (self.k1 + 1)) / (
            tf + self.k1 * (1 - self.b + self.b * doc_length / self.avg_doc_len)
        )
        return idf * norm_tf

    def query(self, query_terms: List[str], top_k: int = 10) -> List[tuple]:
        """Return top-k (score, doc_id) pairs for query terms."""
        doc_scores: Dict[int, float] = defaultdict(float)

        for term in query_terms:
            for posting in self.index.get_postings(term):
                doc_len = self.index.doc_lengths.get(posting.doc_id, 1)
                doc_scores[posting.doc_id] += self.score(
                    posting.doc_id, doc_len, term, posting.tf
                )

        return heapq.nlargest(top_k, doc_scores.items(), key=lambda x: x[1])

PageRank — Link Authority

from typing import Dict, List, Set
import random

def compute_pagerank(
    graph: Dict[int, List[int]],  # page_id -> [out-link page_ids]
    damping: float = 0.85,
    iterations: int = 50,
    n_pages: int = None
) -> Dict[int, float]:
    """
    PageRank: iterative computation.
    PR(p) = (1-d)/N + d * sum(PR(q)/out_degree(q) for q in in_links(p))
    Converges in ~50 iterations for most graphs.
    Production: MapReduce/Spark for web-scale (10B nodes).
    """
    if n_pages is None:
        n_pages = len(graph)

    # Initialize uniform PageRank
    pr = {page: 1.0 / n_pages for page in graph}

    # Build reverse index (in-links)
    in_links: Dict[int, List[int]] = defaultdict(list)
    for page, out_links in graph.items():
        for target in out_links:
            in_links[target].append(page)

    for _ in range(iterations):
        new_pr = {}
        for page in graph:
            rank_sum = sum(
                pr[q] / max(len(graph[q]), 1)
                for q in in_links.get(page, [])
            )
            new_pr[page] = (1 - damping) / n_pages + damping * rank_sum
        pr = new_pr

    return pr

def personalized_pagerank(
    seed_nodes: Set[int],
    graph: Dict[int, List[int]],
    alpha: float = 0.15,  # Teleport probability back to seed
    steps: int = 1000
) -> Dict[int, int]:
    """
    Random walk with restarts: starts at seed, teleports back with prob alpha.
    Useful for personalized search and recommendations.
    Returns visit counts per node.
    """
    visits: Dict[int, int] = defaultdict(int)
    current = random.choice(list(seed_nodes))

    for _ in range(steps):
        visits[current] += 1
        if random.random() < alpha or not graph.get(current):
            current = random.choice(list(seed_nodes))
        else:
            current = random.choice(graph[current])

    return dict(visits)

Distributed Index Serving

"""
Index sharding strategies:

Term partitioning:
  Shard by term: term "apple" always on shard 3.
  Pro: each shard handles a query independently if query is one term.
  Con: multi-term queries require scatter-gather to all shards.

Document partitioning (more common):
  Shard by document: doc 1-1M on shard 0, 1M-2M on shard 1, etc.
  Pro: each doc data is self-contained on one shard.
  Con: every query must fan-out to ALL shards and merge results.
  This is what Google, Elasticsearch, and Solr use.

Serving architecture:
  Query → Query router
         → Scatter: send to all N index shards in parallel
         → Each shard: BM25 score + local top-k
         → Gather: merge N sorted lists → global top-k
         → Re-ranking: apply ML model (BERT, LambdaMART)
         → Freshness boost, personalization
         → Return top 10 results

Replication:
  Each shard has 3 replicas (primary + 2 read replicas).
  Load balance reads across replicas.
  Primary handles index updates (new documents, refreshes).
"""

class QueryRouter:
    def __init__(self, shards: list):
        self.shards = shards  # List of shard client connections

    def scatter_gather(self, query_terms: list, top_k: int = 10) -> list:
        """Fan out query to all shards, merge results."""
        import concurrent.futures

        # Scatter: query all shards in parallel
        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = {
                executor.submit(shard.query, query_terms, top_k): shard
                for shard in self.shards
            }
            all_results = []
            for future in concurrent.futures.as_completed(futures):
                try:
                    results = future.result(timeout=0.15)  # 150ms timeout per shard
                    all_results.extend(results)
                except Exception:
                    pass  # Degraded mode: skip failed shard

        # Gather: merge sorted lists, pick global top-k
        return heapq.nlargest(top_k, all_results, key=lambda x: x[0])

Query Processing Pipeline

import re
from typing import List, Tuple

class QueryProcessor:
    def __init__(self, spell_checker, synonyms_db):
        self.spell = spell_checker
        self.synonyms = synonyms_db

    def process(self, raw_query: str) -> dict:
        # 1. Normalize
        query = raw_query.lower().strip()
        query = re.sub(r"[^ws-]", " ", query)

        # 2. Detect query type — phrase search wrapped in double-quotes
        is_phrase = len(query) > 2 and query[0] == chr(34) and query[-1] == chr(34)
        is_site_filter = "site:" in query

        # 3. Tokenize and spell-correct
        if is_phrase:
            terms = [query[1:-1]]  # Strip surrounding quotes
        else:
            raw_terms = query.split()
            terms = [self.spell.correct(t) for t in raw_terms]

        # 4. Expand with synonyms (optional — boosts recall, hurts precision)
        expanded = []
        for term in terms:
            expanded.append(term)
            expanded.extend(self.synonyms.get(term, [])[:2])  # Max 2 synonyms

        # 5. Extract boolean operators (+ must, - must-not)
        must_terms     = [t[1:] for t in terms if t[:1] == "+"]
        must_not_terms = [t[1:] for t in terms if t[:1] == "-"]
        normal_terms   = [t for t in terms if t[:1] not in ("+", "-")]

        return {
            "must":      must_terms,
            "must_not":  must_not_terms,
            "should":    normal_terms,
            "expanded":  expanded,
            "is_phrase": is_phrase,
        }

Key Design Decisions

Component Technology/Approach Scale Notes
Crawl storage Distributed FS (GFS/HDFS) Petabyte scale; append-only raw HTML
Inverted index Lucene segments on SSD ~500GB/shard; 100s of shards
Index building MapReduce/Spark batch Full rebuild weekly; incremental daily
Ranking BM25 + PageRank + BERT rerank BM25 first-pass; BERT on top-200 candidates
Freshness Real-time index for news/hot topics Separate real-time vs batch index; merge at query time
Spell correction Noisy channel model + n-gram LM “Did you mean?” using edit distance + frequency
Personalization User history embedding Applied at re-ranking; GDPR-compliant with opt-out

  • Atlassian Interview Guide
  • Shopify Interview Guide
  • Cloudflare Interview Guide
  • Databricks Interview Guide
  • Twitter/X Interview Guide
  • LinkedIn Interview Guide
  • Companies That Ask This System Design Question

    This problem commonly appears in interviews at:

    See our company interview guides for full interview process, compensation, and preparation tips.

    Frequently Asked Questions

    What is an inverted index in a search engine?

    An inverted index maps each unique word (term) to the list of documents containing it (posting list). Instead of searching each document, you look up the term and get all matching documents instantly. Each posting typically stores the document ID, term frequency, and token positions for phrase queries.

    What is BM25 and how does it differ from TF-IDF?

    BM25 (Best Match 25) improves on TF-IDF by adding document length normalization and term frequency saturation. In BM25, adding more occurrences of a term gives diminishing returns (k1 parameter), and shorter documents are rewarded over longer ones (b parameter). BM25 is the default ranking in Elasticsearch and Solr.

    How does PageRank work?

    PageRank treats links as votes. A page's rank is proportional to the sum of the ranks of pages linking to it, divided by those pages' out-link counts. With damping factor d=0.85, each page has a 15% chance of random jump (teleportation). Higher-ranked pages linking to you boost your rank more.

    How do you design a search engine for 10 billion web pages?

    The architecture has three main phases: (1) Crawling — distributed crawlers discover and download pages, storing raw HTML. (2) Indexing — MapReduce jobs extract text, tokenize, and build a distributed inverted index partitioned by document. (3) Serving — queries fan out to all index shards in parallel, each returns local top-k results, which are merged and re-ranked by a ML model (BERT).

    {
    “@context”: “https://schema.org”,
    “@type”: “FAQPage”,
    “mainEntity”: [
    {
    “@type”: “Question”,
    “name”: “What is an inverted index in a search engine?”,
    “acceptedAnswer”: {
    “@type”: “Answer”,
    “text”: “An inverted index maps each unique word (term) to the list of documents containing it (posting list). Instead of searching each document, you look up the term and get all matching documents instantly. Each posting typically stores the document ID, term frequency, and token positions for phrase queries.”
    }
    },
    {
    “@type”: “Question”,
    “name”: “What is BM25 and how does it differ from TF-IDF?”,
    “acceptedAnswer”: {
    “@type”: “Answer”,
    “text”: “BM25 (Best Match 25) improves on TF-IDF by adding document length normalization and term frequency saturation. In BM25, adding more occurrences of a term gives diminishing returns (k1 parameter), and shorter documents are rewarded over longer ones (b parameter). BM25 is the default ranking in Elasticsearch and Solr.”
    }
    },
    {
    “@type”: “Question”,
    “name”: “How does PageRank work?”,
    “acceptedAnswer”: {
    “@type”: “Answer”,
    “text”: “PageRank treats links as votes. A page’s rank is proportional to the sum of the ranks of pages linking to it, divided by those pages’ out-link counts. With damping factor d=0.85, each page has a 15% chance of random jump (teleportation). Higher-ranked pages linking to you boost your rank more.”
    }
    },
    {
    “@type”: “Question”,
    “name”: “How do you design a search engine for 10 billion web pages?”,
    “acceptedAnswer”: {
    “@type”: “Answer”,
    “text”: “The architecture has three main phases: (1) Crawling — distributed crawlers discover and download pages, storing raw HTML. (2) Indexing — MapReduce jobs extract text, tokenize, and build a distributed inverted index partitioned by document. (3) Serving — queries fan out to all index shards in parallel, each returns local top-k results, which are merged and re-ranked by a ML model (BERT).”
    }
    }
    ]
    }

    Scroll to Top