Search Engine System Low-Level Design

Overview

A search engine has four major subsystems working in a pipeline: a web crawler fetches pages from the internet, an indexer processes and stores them in an inverted index, a query processor parses and executes user queries against the index, a ranking engine scores results by relevance, and a serving layer returns the top results with low latency.

Web Crawler Design

The crawler starts with a seed set of URLs and discovers new pages via BFS over the link graph. Key design considerations:

  • Frontier queue: A distributed priority queue of URLs to crawl, bucketed by domain to enforce per-domain crawl rate limits.
  • Politeness: Respect robots.txt exclusion rules and Crawl-delay directives. Never hammer a single host.
  • Deduplication: Maintain a URL hash set (e.g., in Redis or a distributed Bloom filter) to avoid re-crawling the same URL.
  • Scale: Distributed across thousands of crawler machines. Each machine owns a subset of domains.
  • Re-crawl scheduling: Pages are re-crawled at a frequency proportional to their change rate – news pages daily, static pages monthly.

Document Processing

Raw HTML fetched by the crawler goes through a processing pipeline before indexing:

  • HTML parsing: Extract text content, strip tags, decode entities.
  • Language detection: Identify the document language so the correct stemmer is applied.
  • Tokenization: Split text into individual tokens (words).
  • Stemming / lemmatization: Reduce tokens to their root form so “running”, “runs”, “ran” all map to “run”.
  • Stop word removal: Optionally drop high-frequency words (“the”, “a”, “is”) that carry little discriminating signal.

Inverted Index Structure

The inverted index maps each term to a postings list – a sorted list of entries, one per document containing the term. Each entry contains:

  • doc_id – the document identifier
  • term_frequency – how many times the term appears in the document
  • positions – the offsets of each occurrence (needed for phrase queries)

Postings lists are sorted by doc_id in ascending order. This enables efficient intersection of two postings lists via a two-pointer merge in O(n + m) time.

Compression: Instead of storing absolute doc_ids, store delta-encoded gaps (doc_id[i] – doc_id[i-1]). Gaps are small for common terms, making variable-length encoding (VByte or Elias gamma) very effective.

TF-IDF Ranking

TF-IDF scores documents by how relevant they are to a query term:

  • TF (Term Frequency): How often the term appears in the document. A document mentioning “python” 10 times is more relevant than one mentioning it once.
  • IDF (Inverse Document Frequency): log(N / df) where N is total documents and df is the number of documents containing the term. Rare terms get a higher IDF – “photosynthesis” is more discriminating than “the”.
  • Score: TF * IDF. High for documents where a rare query term appears frequently.

For multi-term queries, sum the TF-IDF scores across all query terms.

BM25

BM25 is a more sophisticated ranking function that improves on TF-IDF in two ways:

  • TF saturation: Doubling term frequency does not double the score. The score plateaus via a saturation parameter k1 (typically 1.2-2.0). This prevents a document that repeats a term 100 times from completely dominating.
  • Document length normalization: Long documents naturally contain more term occurrences. BM25 normalizes by document length relative to the average document length in the corpus, controlled by parameter b (typically 0.75).

BM25 is the default ranking function in Elasticsearch, Apache Solr, and most modern search engines. It consistently outperforms plain TF-IDF in benchmark evaluations.

Query Processing

When a user submits a query, the query processor runs this pipeline:

  • Tokenize the query using the same tokenizer used at index time.
  • Look up postings lists for each query token.
  • Intersect postings lists (for AND queries) using a merge algorithm – walk two sorted lists with two pointers, output only matching doc_ids. Process shortest list first to prune early.
  • Score each candidate document using BM25 or TF-IDF.
  • Return top-K results using a min-heap of size K.

For OR queries, take the union of postings lists. For phrase queries, use position information to verify terms appear adjacently.

Index Update Strategies

The index must stay fresh as new pages are crawled:

  • Batch rebuild: Rebuild the entire index from scratch nightly. Simple but adds up to 24 hours of freshness lag. Works for smaller corpora.
  • Incremental / delta index: Maintain a small “delta index” containing only recent changes. Queries search both the main index and delta index and merge results. Periodically merge the delta index into the main index. Used by most large search engines.
  • Near-real-time: Systems like Elasticsearch use segment-based storage where new documents are written to small in-memory segments, committed to disk frequently (every second), and merged in the background.

PageRank

PageRank is an iterative algorithm that computes the importance of each page based on the link structure of the web. A page is important if many important pages link to it. The score is computed by repeatedly propagating scores along links until convergence. PageRank is pre-computed offline and used as a prior for ranking – combined with BM25 relevance score, not used alone.

Scale

Google indexes over 100 billion pages. To handle this scale:

  • Sharding: The index is sharded by URL hash across thousands of machines. Each shard holds a partial inverted index covering a subset of the documents.
  • Query fan-out: A query is broadcast to all shards. Each shard returns its local top-K results. A root server merges the results across shards and returns the global top-K.
  • Replication: Each shard is replicated 3x for fault tolerance and read throughput.
  • Caching: Results for popular queries are cached at a query cache layer with TTL of minutes to hours.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does an inverted index enable fast full-text search?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”An inverted index maps each term to a sorted list of documents containing that term (the postings list). Example: "algorithm" -> [(doc3, freq:2, pos:[10,45]), (doc7, freq:1, pos:[3])]. To answer a multi-term query "algorithm interview": look up both terms, retrieve their postings lists, intersect the two sorted lists using a merge algorithm (O(n) where n is the shorter list). The result is documents containing both terms. Scoring: for each matched document, compute TF-IDF or BM25 score. Return top-K by score. The inverted index trades storage (the index can be as large as the original corpus) for query speed (from O(n * doc_size) linear scan to O(postings_list_size)).”}},{“@type”:”Question”,”name”:”What is BM25 and why is it better than TF-IDF?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”BM25 (Best Match 25) improves on TF-IDF with two key additions: (1) Term frequency saturation: in TF-IDF, doubling term frequency doubles the score. In BM25, the TF component saturates: score(tf) = tf * (k1 + 1) / (tf + k1) where k1=1.2-2.0. Going from 1 to 2 occurrences gives more boost than going from 10 to 11. (2) Document length normalization: longer documents naturally contain more terms. BM25 penalizes long documents: tf_normalized = tf / (1 – b + b * doc_len / avg_doc_len) where b=0.75. This prevents long documents from dominating search results just because they have more words. BM25 has been the standard baseline for over 25 years and is used by Elasticsearch, Lucene, and most enterprise search systems.”}},{“@type”:”Question”,”name”:”How does a web crawler work at scale?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A web crawler starts with a seed set of URLs in a frontier queue. The crawl loop: (1) Dequeue a URL. (2) Check if already crawled (URL fingerprint in a distributed hash set like Redis). (3) Fetch the page, respecting robots.txt and crawl-delay directives. (4) Parse HTML, extract links, enqueue new URLs. (5) Store the fetched content for indexing. At scale: the frontier is partitioned by domain (each crawler worker owns specific domains) to respect per-domain rate limits and avoid IP bans. Prioritization: high-PageRank or frequently-updated pages are re-crawled more often. Deduplication: near-duplicate detection using SimHash to avoid indexing the same content multiple times. Google's crawlers process billions of pages per day across tens of thousands of servers.”}},{“@type”:”Question”,”name”:”How do you handle real-time index updates in a search engine?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Two approaches: (1) Batch rebuild: rebuild the entire index nightly from a full corpus snapshot. Simple but index is 24 hours stale. (2) Near-real-time indexing: maintain a small in-memory "delta index" of documents added or updated in the last few minutes. On query: search both the main index and the delta index, merge results. Periodically merge the delta index into the main index. Elasticsearch uses this approach: new documents are first written to an in-memory buffer, flushed to a new Lucene segment every second (near-real-time), then segments are periodically merged. Deletion is handled with a tombstone bit-set – deleted documents are filtered from results until the next segment merge removes them permanently.”}},{“@type”:”Question”,”name”:”How is a search index sharded across multiple machines?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Document sharding: split the corpus into N shards (e.g., by URL hash or document ID). Each shard contains a complete inverted index for its portion of the corpus. On a query: broadcast the query to all shards in parallel, each shard returns its top-K results with scores, a coordinator merges all N*K results and returns the global top-K. This is the "scatter-gather" or "fan-out" pattern. Term-based sharding (each machine handles specific terms) is an alternative but creates hot shards for common terms. N shards * parallelism = latency of one shard (O(1) scaling). Scale: Google uses thousands of shards, each shard is a ~1TB inverted index stored on a cluster of machines for redundancy.”}}]}

LinkedIn search indexes hundreds of millions of profiles. See system design questions for LinkedIn interview: search engine and indexing system design.

Twitter search indexes billions of tweets. See system design patterns for Twitter/X interview: search and indexing system design.

Pinterest uses search and indexing for visual discovery. See design patterns for Pinterest interview: visual search and indexing system design.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Atlassian Interview Guide

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top