System Design Interview: Web Crawler (Google Search Indexing)

What Is a Web Crawler?

A web crawler (spider) systematically browses the web, downloading pages and extracting links to discover new pages. Google uses Googlebot to crawl and index the web. A production crawler must handle billions of pages, respect website politeness rules, avoid duplicate content, and prioritize important pages. This is a classic system design question testing distributed systems knowledge, queue design, and politeness policy.

High-Level Architecture

  • URL Frontier: priority queue of URLs to crawl, organized by priority and domain
  • Fetcher: downloads web pages from URLs in the frontier
  • Parser: extracts links and content from downloaded HTML
  • Deduplicator: filters already-seen URLs and duplicate content
  • URL Filter: removes URLs matching exclusion patterns (robots.txt, file extensions)
  • Content Store: stores raw HTML and extracted text for the search index

URL Frontier Design

The URL Frontier is the most critical component. It must: (1) Prioritize high-value URLs (PageRank-like signal, freshness, update frequency). (2) Enforce politeness — do not send too many requests to one domain too fast. (3) Support massive scale — billions of URLs in the queue.

Two-queue architecture: a priority queue (front queue) sorts URLs by priority score. A politeness queue (back queue) has one sub-queue per domain; each domain is rate-limited to one request per second minimum. The crawler picks the domain whose last-fetch time is oldest, pops the top URL from that domain queue, and fetches it. This ensures no single domain is hammered while still maximizing throughput across many domains.


# Simplified URL scheduler
class URLFrontier:
    def __init__(self):
        self.priority_queue = []       # (priority_score, url)
        self.domain_queues = {}        # domain -> deque of urls
        self.last_fetch = {}           # domain -> timestamp
        self.min_crawl_delay = 1.0     # seconds between requests per domain

    def next_url(self):
        now = time.time()
        for domain, queue in self.domain_queues.items():
            if queue and now - self.last_fetch.get(domain, 0) >= self.min_crawl_delay:
                self.last_fetch[domain] = now
                return queue.popleft()
        return None

robots.txt and Politeness

Every crawler must respect robots.txt, a standard file at the domain root that specifies which paths are disallowed for crawlers. Fetch and cache robots.txt per domain before crawling any pages from that domain. Respect Crawl-delay directives (minimum seconds between requests). Honor Disallow rules — crawling disallowed paths can get your IP blocked or violate legal terms. Cache robots.txt with a 24-hour TTL and re-fetch periodically. A crawler that ignores robots.txt is an irresponsible crawler and will get blacklisted by CDN providers.

Deduplication

Two types of deduplication: (1) URL deduplication: avoid crawling the same URL twice. Use a distributed hash set (Redis SET) — store the SHA-256 hash of the normalized URL. Before enqueuing a URL, check if its hash is in the set. If yes, discard it. URL normalization is critical: http://example.com, http://example.com/, and http://example.com/index.html may all be the same page. (2) Content deduplication: some sites serve the same content at multiple URLs. Compute a SimHash of the page content — SimHash is a locality-sensitive hash where similar documents have similar hash values. If a new page has SimHash within a small Hamming distance of an already-indexed page, it is a near-duplicate and can be skipped or merged.

Handling JavaScript-Rendered Pages

Many modern sites use JavaScript frameworks (React, Angular, Vue) that render content client-side. A basic HTTP fetch gets only the initial HTML skeleton — no actual content. Solutions: (1) Headless browser rendering: run Chrome in headless mode (using Puppeteer or Selenium) to execute JavaScript and capture the fully-rendered DOM. High resource cost: 10x slower than raw HTTP fetch, requires 10x more compute. Use for high-priority pages. (2) Server-side rendering (SSR): many frameworks pre-render on the server. Detect SSR support and use it. (3) Structured data fallback: if the page provides JSON-LD or OpenGraph metadata, extract structured content even without rendering. Google uses Wave-1 for raw HTML crawling and a separate rendering cluster for JavaScript-heavy pages.

Scale Estimation

Web scale: 5 billion pages, recrawled every 30 days on average. Pages per second: 5B / (30 * 86400) = ~1930 pages/second. Average page size: 500KB. Bandwidth: 1930 * 500KB = ~1GB/s download. Storage for raw HTML: 5B * 500KB = 2.5 petabytes. With 100 crawler machines each doing 20 RPS: 2000 RPS total. Each machine needs 1Gbps+ network and SSDs for local URL queue caching. A DNS resolver pool is needed — 2000 new domains/second would overwhelm public DNS; use a local DNS cache or dedicated resolver cluster.

Interview Tips

  • robots.txt and politeness are often overlooked — mention them early to stand out
  • The URL Frontier two-queue design (priority + per-domain queues) is the key insight
  • URL deduplication: Bloom filter for memory efficiency (probabilistic, fast) or Redis SET for exact matching
  • Content deduplication: SimHash — interviewers rarely hear this term and it impresses
  • JavaScript rendering: mention headless Chrome and the cost tradeoff

Frequently Asked Questions

How does a web crawler handle politeness and avoid overloading websites?

Politeness is enforced at two levels: (1) robots.txt compliance: fetch and cache the robots.txt file from each domain before crawling any pages. Respect Disallow directives (blocked paths) and Crawl-delay (minimum seconds between requests). Cache robots.txt for 24 hours and re-fetch periodically. (2) Per-domain rate limiting: use a back-queue architecture where each domain has its own URL queue. The scheduler selects domains round-robin, only dispatching a new request to a domain after the configured crawl delay has elapsed (typically 1 second minimum). This prevents hammering a single server regardless of how many URLs from that domain are in the frontier. Large-scale crawlers like Googlebot also identify themselves via User-Agent header and maintain published crawl policies, so website operators can configure crawl rates.

How do you avoid crawling the same page twice in a web crawler?

URL deduplication uses a hash set: normalize the URL (lowercase, remove default ports, resolve relative paths, sort query parameters), compute its SHA-256 hash, and check if it is already in a Redis SET before enqueuing. If present, discard the URL. At web scale (trillions of URLs), a Bloom filter provides memory-efficient probabilistic deduplication: a 10-billion-element Bloom filter at 1% false positive rate uses about 12GB of memory vs 400GB for an exact hash set. False positives (occasionally skipping a URL that was not actually crawled) are acceptable — the page will be picked up in a future crawl cycle. Content deduplication catches pages that have different URLs but identical content: compute SimHash of the page text and compare Hamming distance to known hashes. Near-duplicates within 3 bits are filtered.

How does a web crawler prioritize which pages to crawl first?

URL priority is a function of several signals: (1) PageRank or link count: pages with more inbound links from authoritative sites are crawled first. (2) Content freshness: pages that change frequently (news sites, stock prices) should be recrawled more often than static documentation pages. Use the Last-Modified header and ETag to detect changes without re-downloading the full page. (3) Domain authority: crawl all pages from high-authority domains (Wikipedia, major news sites) before deep-crawling low-authority blogs. (4) Discovery order: newly discovered URLs from high-priority pages inherit some priority. The URL Frontier implements this as a two-level queue: a priority queue orders URLs by score, which feeds into per-domain politeness queues that enforce rate limits. Scores are updated continuously as new link data and freshness signals are collected.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “How does a web crawler handle politeness and avoid overloading websites?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Politeness is enforced at two levels: (1) robots.txt compliance: fetch and cache the robots.txt file from each domain before crawling any pages. Respect Disallow directives (blocked paths) and Crawl-delay (minimum seconds between requests). Cache robots.txt for 24 hours and re-fetch periodically. (2) Per-domain rate limiting: use a back-queue architecture where each domain has its own URL queue. The scheduler selects domains round-robin, only dispatching a new request to a domain after the configured crawl delay has elapsed (typically 1 second minimum). This prevents hammering a single server regardless of how many URLs from that domain are in the frontier. Large-scale crawlers like Googlebot also identify themselves via User-Agent header and maintain published crawl policies, so website operators can configure crawl rates.” } }, { “@type”: “Question”, “name”: “How do you avoid crawling the same page twice in a web crawler?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “URL deduplication uses a hash set: normalize the URL (lowercase, remove default ports, resolve relative paths, sort query parameters), compute its SHA-256 hash, and check if it is already in a Redis SET before enqueuing. If present, discard the URL. At web scale (trillions of URLs), a Bloom filter provides memory-efficient probabilistic deduplication: a 10-billion-element Bloom filter at 1% false positive rate uses about 12GB of memory vs 400GB for an exact hash set. False positives (occasionally skipping a URL that was not actually crawled) are acceptable — the page will be picked up in a future crawl cycle. Content deduplication catches pages that have different URLs but identical content: compute SimHash of the page text and compare Hamming distance to known hashes. Near-duplicates within 3 bits are filtered.” } }, { “@type”: “Question”, “name”: “How does a web crawler prioritize which pages to crawl first?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “URL priority is a function of several signals: (1) PageRank or link count: pages with more inbound links from authoritative sites are crawled first. (2) Content freshness: pages that change frequently (news sites, stock prices) should be recrawled more often than static documentation pages. Use the Last-Modified header and ETag to detect changes without re-downloading the full page. (3) Domain authority: crawl all pages from high-authority domains (Wikipedia, major news sites) before deep-crawling low-authority blogs. (4) Discovery order: newly discovered URLs from high-priority pages inherit some priority. The URL Frontier implements this as a two-level queue: a priority queue orders URLs by score, which feeds into per-domain politeness queues that enforce rate limits. Scores are updated continuously as new link data and freshness signals are collected.” } } ] }

Companies That Ask This Question

Scroll to Top