System Design Interview: Web Crawler (Google Search Indexing)

⏱ 8 min read

What Is a Web Crawler?

A web crawler (spider) systematically browses the web, downloading pages and extracting links to discover new pages. Google uses Googlebot to crawl and index the web. A production crawler must handle billions of pages, respect website politeness rules, avoid duplicate content, and prioritize important pages. This is a classic system design question testing distributed systems knowledge, queue design, and politeness policy.

High-Level Architecture

URL Frontier: priority queue of URLs to crawl, organized by priority and domain
Fetcher: downloads web pages from URLs in the frontier
Parser: extracts links and content from downloaded HTML
Deduplicator: filters already-seen URLs and duplicate content
URL Filter: removes URLs matching exclusion patterns (robots.txt, file extensions)
Content Store: stores raw HTML and extracted text for the search index

URL Frontier Design

The URL Frontier is the most critical component. It must: (1) Prioritize high-value URLs (PageRank-like signal, freshness, update frequency). (2) Enforce politeness — do not send too many requests to one domain too fast. (3) Support massive scale — billions of URLs in the queue.

Two-queue architecture: a priority queue (front queue) sorts URLs by priority score. A politeness queue (back queue) has one sub-queue per domain; each domain is rate-limited to one request per second minimum. The crawler picks the domain whose last-fetch time is oldest, pops the top URL from that domain queue, and fetches it. This ensures no single domain is hammered while still maximizing throughput across many domains.


# Simplified URL scheduler
class URLFrontier:
    def __init__(self):
        self.priority_queue = []       # (priority_score, url)
        self.domain_queues = {}        # domain -> deque of urls
        self.last_fetch = {}           # domain -> timestamp
        self.min_crawl_delay = 1.0     # seconds between requests per domain

    def next_url(self):
        now = time.time()
        for domain, queue in self.domain_queues.items():
            if queue and now - self.last_fetch.get(domain, 0) >= self.min_crawl_delay:
                self.last_fetch[domain] = now
                return queue.popleft()
        return None

robots.txt and Politeness

Every crawler must respect robots.txt, a standard file at the domain root that specifies which paths are disallowed for crawlers. Fetch and cache robots.txt per domain before crawling any pages from that domain. Respect Crawl-delay directives (minimum seconds between requests). Honor Disallow rules — crawling disallowed paths can get your IP blocked or violate legal terms. Cache robots.txt with a 24-hour TTL and re-fetch periodically. A crawler that ignores robots.txt is an irresponsible crawler and will get blacklisted by CDN providers.

Deduplication

Two types of deduplication: (1) URL deduplication: avoid crawling the same URL twice. Use a distributed hash set (Redis SET) — store the SHA-256 hash of the normalized URL. Before enqueuing a URL, check if its hash is in the set. If yes, discard it. URL normalization is critical: http://example.com, http://example.com/, and http://example.com/index.html may all be the same page. (2) Content deduplication: some sites serve the same content at multiple URLs. Compute a SimHash of the page content — SimHash is a locality-sensitive hash where similar documents have similar hash values. If a new page has SimHash within a small Hamming distance of an already-indexed page, it is a near-duplicate and can be skipped or merged.

Handling JavaScript-Rendered Pages

Many modern sites use JavaScript frameworks (React, Angular, Vue) that render content client-side. A basic HTTP fetch gets only the initial HTML skeleton — no actual content. Solutions: (1) Headless browser rendering: run Chrome in headless mode (using Puppeteer or Selenium) to execute JavaScript and capture the fully-rendered DOM. High resource cost: 10x slower than raw HTTP fetch, requires 10x more compute. Use for high-priority pages. (2) Server-side rendering (SSR): many frameworks pre-render on the server. Detect SSR support and use it. (3) Structured data fallback: if the page provides JSON-LD or OpenGraph metadata, extract structured content even without rendering. Google uses Wave-1 for raw HTML crawling and a separate rendering cluster for JavaScript-heavy pages.

Scale Estimation

Web scale: 5 billion pages, recrawled every 30 days on average. Pages per second: 5B / (30 * 86400) = ~1930 pages/second. Average page size: 500KB. Bandwidth: 1930 * 500KB = ~1GB/s download. Storage for raw HTML: 5B * 500KB = 2.5 petabytes. With 100 crawler machines each doing 20 RPS: 2000 RPS total. Each machine needs 1Gbps+ network and SSDs for local URL queue caching. A DNS resolver pool is needed — 2000 new domains/second would overwhelm public DNS; use a local DNS cache or dedicated resolver cluster.

Interview Tips

robots.txt and politeness are often overlooked — mention them early to stand out
The URL Frontier two-queue design (priority + per-domain queues) is the key insight
URL deduplication: Bloom filter for memory efficiency (probabilistic, fast) or Redis SET for exact matching
Content deduplication: SimHash — interviewers rarely hear this term and it impresses
JavaScript rendering: mention headless Chrome and the cost tradeoff

Frequently Asked Questions

How does a web crawler handle politeness and avoid overloading websites?

Politeness is enforced at two levels: (1) robots.txt compliance: fetch and cache the robots.txt file from each domain before crawling any pages. Respect Disallow directives (blocked paths) and Crawl-delay (minimum seconds between requests). Cache robots.txt for 24 hours and re-fetch periodically. (2) Per-domain rate limiting: use a back-queue architecture where each domain has its own URL queue. The scheduler selects domains round-robin, only dispatching a new request to a domain after the configured crawl delay has elapsed (typically 1 second minimum). This prevents hammering a single server regardless of how many URLs from that domain are in the frontier. Large-scale crawlers like Googlebot also identify themselves via User-Agent header and maintain published crawl policies, so website operators can configure crawl rates.

How do you avoid crawling the same page twice in a web crawler?

URL deduplication uses a hash set: normalize the URL (lowercase, remove default ports, resolve relative paths, sort query parameters), compute its SHA-256 hash, and check if it is already in a Redis SET before enqueuing. If present, discard the URL. At web scale (trillions of URLs), a Bloom filter provides memory-efficient probabilistic deduplication: a 10-billion-element Bloom filter at 1% false positive rate uses about 12GB of memory vs 400GB for an exact hash set. False positives (occasionally skipping a URL that was not actually crawled) are acceptable — the page will be picked up in a future crawl cycle. Content deduplication catches pages that have different URLs but identical content: compute SimHash of the page text and compare Hamming distance to known hashes. Near-duplicates within 3 bits are filtered.

How does a web crawler prioritize which pages to crawl first?

URL priority is a function of several signals: (1) PageRank or link count: pages with more inbound links from authoritative sites are crawled first. (2) Content freshness: pages that change frequently (news sites, stock prices) should be recrawled more often than static documentation pages. Use the Last-Modified header and ETag to detect changes without re-downloading the full page. (3) Domain authority: crawl all pages from high-authority domains (Wikipedia, major news sites) before deep-crawling low-authority blogs. (4) Discovery order: newly discovered URLs from high-priority pages inherit some priority. The URL Frontier implements this as a two-level queue: a priority queue orders URLs by score, which feeds into per-domain politeness queues that enforce rate limits. Scores are updated continuously as new link data and freshness signals are collected.

Netflix Interview Guide

Airbnb Interview Guide

Cloudflare Interview Guide

Twitter Interview Guide

Companies That Ask This Question

🏢 Asked at: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture