Low Level Design: Search Engine Architecture

⏱ 6 min read

Building a web-scale search engine is one of the most complex distributed systems problems in computer science. It requires crawling billions of pages, building inverted indexes across petabytes of text, ranking results in milliseconds, and serving thousands of queries per second. This article traces the full pipeline from crawl to result.

Crawler Architecture

The crawler is the data collection engine. Architecture overview: start with a seed set of known URLs. Maintain a URL frontier — a priority queue of URLs to visit, distributed across many worker nodes. Workers fetch pages, parse HTML, extract links, and feed new URLs back into the frontier.

Key constraints: politeness — respect robots.txt, enforce per-domain crawl delays to avoid overwhelming origin servers. Crawl budget — limit pages crawled per domain based on PageRank and freshness signals; crawl high-value pages more frequently. Deduplication — avoid re-crawling identical or near-identical pages. SimHash (locality-sensitive hashing) computes a fingerprint of page content such that similar pages produce similar hashes — Hamming distance below a threshold flags near-duplicates. Exact duplicates are caught with MD5/SHA-256 of normalized content.

Link Graph Extraction and Anchor Text

Every page crawled yields a set of outbound links. The crawler builds a directed link graph: nodes are pages, edges are hyperlinks. This graph is the input to PageRank. Anchor text — the clickable text of a hyperlink — is extracted and associated with the target page, not the source. Anchor text from thousands of inbound links aggregates into a rich, human-authored description of the target page that often describes it more accurately than the page’s own content. This is why a page can rank for terms that do not appear in its body text.

PageRank Algorithm

PageRank models a random web surfer who clicks links randomly. The rank of a page is the probability of the surfer landing on it. The iterative formula:

rank(v) = (1 - d) + d * Σ ( rank(u) / out_degree(u) )
          for each page u with a link to v

d = 0.85  (damping factor — probability of following a link vs jumping randomly)

Initialize all ranks to 1/N. Iterate until convergence (typically ~50 iterations for web-scale graphs). The damping factor 0.85 represents the 85% probability that the surfer follows a link; 15% probability they jump to a random page (avoiding rank sink traps in disconnected subgraphs). At scale, PageRank is computed with distributed graph processing frameworks (Pregel, Apache Giraph, Spark GraphX) since the full link graph does not fit in a single machine’s memory.

Inverted Index Construction

The inverted index maps each term to the list of documents containing it (postings list). Construction pipeline:

Forward index: for each document, record (doc_id → list of terms with positions). Built during parsing.
Inversion via MapReduce: Map phase emits (term, doc_id, position) for every term occurrence. Reduce phase groups by term and produces a sorted postings list: term → [(doc_id, tf, [positions]), ...].
Postings list encoding: doc IDs are stored as delta-encoded gaps (smaller numbers compress better with variable-byte or Elias-gamma encoding). Positions are stored for phrase query support.
TF-IDF and BM25: term frequency (TF) and inverse document frequency (IDF) are precomputed and stored in the postings list for fast scoring at query time.

Distributed Index Sharding

A web-scale index cannot fit on one machine. Two sharding strategies:

Document sharding (horizontal): each shard holds the full index for a subset of documents. Queries fan out to all shards in parallel, each returns top-K results, and a merging layer combines and re-ranks them (top-K merge). Simple to implement, but query fan-out is proportional to shard count.
Term sharding (vertical): each shard holds the postings for a subset of terms. Multi-term queries must contact multiple shards. Better for very long postings lists but complex to coordinate for conjunctive queries.

Production systems (Google, Bing) use document sharding with thousands of leaf shards, organized in a two-tier hierarchy: leaf servers hold shards, index servers aggregate results from leaves, root servers aggregate across index servers. Replication provides fault tolerance — each shard is replicated 3x.

Query Processing Pipeline

A search query passes through multiple stages before results are returned:

Parse: tokenize, normalize (lowercase, stemming, stop word removal), identify operators (site:, filetype:, quotes for phrase queries).
Spell correction: Noisy channel model or neural spell correction suggests did you mean alternatives using n-gram language models trained on query logs.
Query expansion: add synonyms or related terms (WordNet, word2vec neighbors) to improve recall without hurting precision.
Retrieval: ANN (Approximate Nearest Neighbor) vector search for semantic matching + BM25 for lexical matching. Hybrid retrieval combines both scores.
Reranking: a learning-to-rank model reorders the top-N retrieved candidates using richer features.
Snippet generation: extract a human-readable summary for display.

Learning to Rank

Learning to rank (LTR) trains a model to predict the relevance ordering of documents for a query. The industry standard algorithm is LambdaMART — a gradient boosted decision tree (GBDT) optimized directly for ranking metrics (NDCG). Training data comes from human relevance judgments and implicit click signals (click-through rate, dwell time, pogo-sticking). Feature set includes:

BM25 score for the query-document pair.
PageRank of the document.
Freshness — recency of last crawl and content change frequency.
Click-through rate (CTR) — historical clicks for this query-URL pair.
TF-IDF variants over title, body, anchor text, URL fields separately.
Semantic similarity — cosine similarity between query and document embeddings.

Neural ranking (BERT-based cross-encoders) provides state-of-the-art relevance but is expensive at scale. Production systems use a cascade: fast BM25 retrieves thousands of candidates, a lightweight GBDT re-ranks hundreds, a heavier neural model reranks the top tens.

Snippet Generation

A snippet is the two-to-three line summary shown under each result. The goal is to find the passage most relevant to the query. Algorithm: split the document into sentences or fixed-length windows. Score each window by the density of query terms it contains (count of matching terms / window length). Select the highest-scoring window. Bold matching query terms in the output. For structured data (articles with schema.org markup), prefer the meta description or the first paragraph. Google’s featured snippets extract a direct answer by identifying windows that start with a definition pattern or follow a question-like structure in the query.

Index Freshness

Web content changes continuously — news articles publish every second. A batch-built index updated weekly is insufficient. Production search engines maintain a real-time index (sometimes called a "freshness layer") separate from the main index. The real-time index ingests new and changed pages with latency of seconds to minutes. At query time, results from both indexes are merged and the top-K unified set is returned. The main index handles the long tail of stable content; the real-time index handles breaking news and freshly published pages. Google’s Caffeine architecture (2010) made continuous incremental indexing the default, replacing the batch crawl/index cycle.

Google-Specific Systems: Knowledge Graph and E-E-A-T

Knowledge Graph: a structured database of entities (people, places, organizations) and their relationships, derived from Wikipedia, Freebase, and web extraction. Used to power the knowledge panel (the info box on the right side of results) and to understand entity-centric queries ("height of Eiffel Tower") without retrieving documents.

E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) is Google’s quality framework for evaluating content and sources. Introduced in the Search Quality Evaluator Guidelines, it is used to train quality rater models that feed into the ranking pipeline. Experience was added (making it E-E-A-T instead of E-A-T) to reflect first-hand experience as a quality signal — a doctor writing about medicine carries more weight than an anonymous post. High E-E-A-T signals include author credentials, citations from authoritative domains, editorial standards, and factual accuracy. YMYL (Your Money or Your Life) pages — health, finance, legal — are held to the highest E-E-A-T standards.