Overview
A search engine has four major subsystems working in a pipeline: a web crawler fetches pages from the internet, an indexer processes and stores them in an inverted index, a query processor parses and executes user queries against the index, a ranking engine scores results by relevance, and a serving layer returns the top results with low latency.
Web Crawler Design
The crawler starts with a seed set of URLs and discovers new pages via BFS over the link graph. Key design considerations:
- Frontier queue: A distributed priority queue of URLs to crawl, bucketed by domain to enforce per-domain crawl rate limits.
- Politeness: Respect
robots.txtexclusion rules andCrawl-delaydirectives. Never hammer a single host. - Deduplication: Maintain a URL hash set (e.g., in Redis or a distributed Bloom filter) to avoid re-crawling the same URL.
- Scale: Distributed across thousands of crawler machines. Each machine owns a subset of domains.
- Re-crawl scheduling: Pages are re-crawled at a frequency proportional to their change rate – news pages daily, static pages monthly.
Document Processing
Raw HTML fetched by the crawler goes through a processing pipeline before indexing:
- HTML parsing: Extract text content, strip tags, decode entities.
- Language detection: Identify the document language so the correct stemmer is applied.
- Tokenization: Split text into individual tokens (words).
- Stemming / lemmatization: Reduce tokens to their root form so “running”, “runs”, “ran” all map to “run”.
- Stop word removal: Optionally drop high-frequency words (“the”, “a”, “is”) that carry little discriminating signal.
Inverted Index Structure
The inverted index maps each term to a postings list – a sorted list of entries, one per document containing the term. Each entry contains:
doc_id– the document identifierterm_frequency– how many times the term appears in the documentpositions– the offsets of each occurrence (needed for phrase queries)
Postings lists are sorted by doc_id in ascending order. This enables efficient intersection of two postings lists via a two-pointer merge in O(n + m) time.
Compression: Instead of storing absolute doc_ids, store delta-encoded gaps (doc_id[i] – doc_id[i-1]). Gaps are small for common terms, making variable-length encoding (VByte or Elias gamma) very effective.
TF-IDF Ranking
TF-IDF scores documents by how relevant they are to a query term:
- TF (Term Frequency): How often the term appears in the document. A document mentioning “python” 10 times is more relevant than one mentioning it once.
- IDF (Inverse Document Frequency):
log(N / df)where N is total documents and df is the number of documents containing the term. Rare terms get a higher IDF – “photosynthesis” is more discriminating than “the”. - Score:
TF * IDF. High for documents where a rare query term appears frequently.
For multi-term queries, sum the TF-IDF scores across all query terms.
BM25
BM25 is a more sophisticated ranking function that improves on TF-IDF in two ways:
- TF saturation: Doubling term frequency does not double the score. The score plateaus via a saturation parameter
k1(typically 1.2-2.0). This prevents a document that repeats a term 100 times from completely dominating. - Document length normalization: Long documents naturally contain more term occurrences. BM25 normalizes by document length relative to the average document length in the corpus, controlled by parameter
b(typically 0.75).
BM25 is the default ranking function in Elasticsearch, Apache Solr, and most modern search engines. It consistently outperforms plain TF-IDF in benchmark evaluations.
Query Processing
When a user submits a query, the query processor runs this pipeline:
- Tokenize the query using the same tokenizer used at index time.
- Look up postings lists for each query token.
- Intersect postings lists (for AND queries) using a merge algorithm – walk two sorted lists with two pointers, output only matching doc_ids. Process shortest list first to prune early.
- Score each candidate document using BM25 or TF-IDF.
- Return top-K results using a min-heap of size K.
For OR queries, take the union of postings lists. For phrase queries, use position information to verify terms appear adjacently.
Index Update Strategies
The index must stay fresh as new pages are crawled:
- Batch rebuild: Rebuild the entire index from scratch nightly. Simple but adds up to 24 hours of freshness lag. Works for smaller corpora.
- Incremental / delta index: Maintain a small “delta index” containing only recent changes. Queries search both the main index and delta index and merge results. Periodically merge the delta index into the main index. Used by most large search engines.
- Near-real-time: Systems like Elasticsearch use segment-based storage where new documents are written to small in-memory segments, committed to disk frequently (every second), and merged in the background.
PageRank
PageRank is an iterative algorithm that computes the importance of each page based on the link structure of the web. A page is important if many important pages link to it. The score is computed by repeatedly propagating scores along links until convergence. PageRank is pre-computed offline and used as a prior for ranking – combined with BM25 relevance score, not used alone.
Scale
Google indexes over 100 billion pages. To handle this scale:
- Sharding: The index is sharded by URL hash across thousands of machines. Each shard holds a partial inverted index covering a subset of the documents.
- Query fan-out: A query is broadcast to all shards. Each shard returns its local top-K results. A root server merges the results across shards and returns the global top-K.
- Replication: Each shard is replicated 3x for fault tolerance and read throughput.
- Caching: Results for popular queries are cached at a query cache layer with TTL of minutes to hours.
LinkedIn search indexes hundreds of millions of profiles. See system design questions for LinkedIn interview: search engine and indexing system design.
Twitter search indexes billions of tweets. See system design patterns for Twitter/X interview: search and indexing system design.
Pinterest uses search and indexing for visual discovery. See design patterns for Pinterest interview: visual search and indexing system design.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Atlassian Interview Guide
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering