Design a Web Crawler

Design a web crawler that can index a significant portion of the internet. This question tests whether you understand distributed systems, politeness constraints, and how to reason about scale — not just “spin up some scrapers.”

Requirements Clarification

Before drawing any boxes, nail down the scope:

  • Scale: How many pages per month? Assume 1 billion pages indexed, 10 billion total URLs known.
  • Freshness: How often should pages be re-crawled? Breaking news sites: every hour. Static blogs: every week. Use a freshness policy.
  • Scope: All content types (HTML, PDF, images) or HTML only? Start with HTML.
  • Storage: Store raw HTML, parsed text, or both? Both — raw for reprocessing, parsed for indexing.
  • Politeness: Must respect robots.txt and crawl-delay directives. Non-negotiable legally and ethically.

Back-of-Envelope Estimates

  • 1 billion pages/month → ~400 pages/second sustained
  • Average page: 100KB compressed (500KB raw HTML) → 200MB/sec → ~6PB/year raw
  • After dedup and compression: ~1PB stored per year
  • URL set: 10 billion URLs × 100 bytes = 1TB for the URL frontier
  • DNS lookups: ~400/sec peak (each unique domain, cached aggressively)

This tells you: you need multiple fetcher machines, a distributed URL frontier, and object storage — not a single server running wget.

High-Level Architecture

The core pipeline has five stages:

URL Frontier → DNS Resolver Pool → Fetcher Pool → HTML Parser → Content Store
                                        ↓
                              Robots.txt Checker
                                        ↓
                              Link Extractor → Dedup Filter → URL Frontier

Each stage is a pool of stateless workers reading from and writing to queues. Nothing shares in-process state.

Deep Dive: Each Component

URL Frontier

The frontier is a priority queue of URLs to crawl. The naive implementation — a single Redis sorted set — breaks at 10 billion URLs. The production approach:

  • Front queues: One per priority tier (high/medium/low). URLs go to a tier based on domain authority or freshness policy.
  • Back queues: One per host. A selector maps each host to exactly one back queue, ensuring politeness — you never issue two requests to the same host simultaneously.
  • Politeness metadata: Each host has a minimum crawl-delay (from robots.txt, or default 1s). The frontier respects it using a “next fetch time” field per host.

At Google’s scale, the frontier is a distributed system in itself — think a sharded key-value store where the key is the URL hash and the value is priority + next-fetch-time.

DNS Resolver Pool

DNS is a hidden bottleneck. If every fetcher does its own DNS lookup, you’ll saturate your ISP’s resolver. Use a shared in-process DNS cache backed by a centralized resolver pool. TTL typically 5–30 minutes. At 400 fetches/sec with ~10K unique domains active, the cache hit rate will be very high.

Fetcher Pool

Fetchers are async HTTP clients (Python aiohttp, Go net/http). Each fetcher:

  1. Checks robots.txt (cached in Redis, TTL 24h)
  2. Issues GET with a legitimate User-Agent string
  3. Follows redirects (max 5 hops)
  4. Enforces timeout (5s connect, 30s read)
  5. Respects crawl-delay from robots.txt
  6. Writes raw HTML + metadata to object storage
  7. Pushes URL + content hash to parser queue

Politeness is enforced per-domain, not per-IP. If a domain has 1,000 pages, you still wait the crawl-delay between each request to that domain.

HTML Parser

Parser workers pull from the queue, extract:

  • All <a href> links → normalized absolute URLs → dedup filter
  • Text content → parsed document → content store
  • Canonical tag if present → override the crawled URL for dedup
  • Meta robots tag (noindex/nofollow) → skip or don’t follow links

JavaScript-heavy pages are a special case. A headless Chrome pool (Puppeteer/Playwright) handles them — but it’s 10–100× more expensive per page. Only use it for high-value domains.

Duplicate Detection

Two types of dedup run in series:

URL dedup: Before fetching, check if the URL was already crawled. Use a distributed hash set (Redis Cluster or HBase). 10B URLs × 8 bytes per hash = 80GB, fits in memory.

Content dedup: After fetching, check if the content is a near-duplicate. SimHash: reduce a document to a 64-bit fingerprint based on weighted word shingles. Two documents within Hamming distance 3 are considered duplicates. Google’s Simhash paper (2007) showed this works at web scale.

Crawl Priority and Freshness

Not all pages are equal. Prioritize by:

  • Domain authority: PageRank-like score from link graph. High-authority domains crawled more frequently.
  • Change frequency: Track how often content changes per domain. Weather sites: every hour. Archive pages: never.
  • User demand: If users search for a topic, crawl related pages more aggressively.

Googlebot’s crawl frequency for a page: roughly proportional to how often the page changes × how important it is. Pages that never change are re-crawled monthly at most.

Storage Design

Object Store (S3/GCS):
  raw_html/{sha256_hash} → compressed raw HTML + headers

Metadata DB (Bigtable / HBase):
  row_key = url_sha256
  columns: url, last_crawled, next_crawl, content_hash, status_code, canonical_url

Parsed Index:
  Passed to indexing pipeline (Elasticsearch / custom inverted index)

Trade-offs and Follow-up Questions

BFS vs DFS: BFS is almost always right for web crawling. DFS can get stuck in infinite link structures (calendars, session-parametrized URLs). BFS naturally discovers high-authority pages first.

URL traps: Calendars, infinitely paginated lists, session IDs in URLs. Mitigate with URL normalization rules (strip session parameters) and max-depth limits per domain.

Distributed coordinator: Who decides which fetcher claims which URL? Use consistent hashing on the domain name — all URLs from the same domain go to the same fetcher group, enforcing politeness without coordination.

Legal considerations: Respect robots.txt. Follow ToS for major sites. Never crawl behind login walls without permission.

Interview Follow-ups

  • How do you handle JavaScript-rendered content at scale without 100× cost increase?
  • Your crawler gets IP-banned by a major news site. How do you handle this gracefully?
  • How do you detect and avoid crawler traps (infinite loops in link structure)?
  • How would you implement incremental re-crawling — only re-fetching pages that changed?
  • How does your architecture change if you need real-time indexing (page to searchable in 30 seconds)?

Related System Design Topics

  • Consistent Hashing — distributing the URL frontier across crawler nodes
  • Message Queues — decoupling fetcher output from parser input with async queues
  • Database Sharding — partitioning the crawled URL store by domain hash
  • Caching Strategies — caching robots.txt, DNS lookups, and content hashes
  • CAP Theorem — eventual consistency in the URL frontier: availability over strong consistency

Companies That Ask This System Design Question

This problem type commonly appears in interviews at:

See our company interview guides for full interview process, compensation, and preparation tips.

Scroll to Top