Design a web crawler that can index a significant portion of the internet. This question tests whether you understand distributed systems, politeness constraints, and how to reason about scale — not just “spin up some scrapers.”
Requirements Clarification
Before drawing any boxes, nail down the scope:
- Scale: How many pages per month? Assume 1 billion pages indexed, 10 billion total URLs known.
- Freshness: How often should pages be re-crawled? Breaking news sites: every hour. Static blogs: every week. Use a freshness policy.
- Scope: All content types (HTML, PDF, images) or HTML only? Start with HTML.
- Storage: Store raw HTML, parsed text, or both? Both — raw for reprocessing, parsed for indexing.
- Politeness: Must respect robots.txt and crawl-delay directives. Non-negotiable legally and ethically.
Back-of-Envelope Estimates
- 1 billion pages/month → ~400 pages/second sustained
- Average page: 100KB compressed (500KB raw HTML) → 200MB/sec → ~6PB/year raw
- After dedup and compression: ~1PB stored per year
- URL set: 10 billion URLs × 100 bytes = 1TB for the URL frontier
- DNS lookups: ~400/sec peak (each unique domain, cached aggressively)
This tells you: you need multiple fetcher machines, a distributed URL frontier, and object storage — not a single server running wget.
High-Level Architecture
The core pipeline has five stages:
URL Frontier → DNS Resolver Pool → Fetcher Pool → HTML Parser → Content Store
↓
Robots.txt Checker
↓
Link Extractor → Dedup Filter → URL Frontier
Each stage is a pool of stateless workers reading from and writing to queues. Nothing shares in-process state.
Deep Dive: Each Component
URL Frontier
The frontier is a priority queue of URLs to crawl. The naive implementation — a single Redis sorted set — breaks at 10 billion URLs. The production approach:
- Front queues: One per priority tier (high/medium/low). URLs go to a tier based on domain authority or freshness policy.
- Back queues: One per host. A selector maps each host to exactly one back queue, ensuring politeness — you never issue two requests to the same host simultaneously.
- Politeness metadata: Each host has a minimum crawl-delay (from robots.txt, or default 1s). The frontier respects it using a “next fetch time” field per host.
At Google’s scale, the frontier is a distributed system in itself — think a sharded key-value store where the key is the URL hash and the value is priority + next-fetch-time.
DNS Resolver Pool
DNS is a hidden bottleneck. If every fetcher does its own DNS lookup, you’ll saturate your ISP’s resolver. Use a shared in-process DNS cache backed by a centralized resolver pool. TTL typically 5–30 minutes. At 400 fetches/sec with ~10K unique domains active, the cache hit rate will be very high.
Fetcher Pool
Fetchers are async HTTP clients (Python aiohttp, Go net/http). Each fetcher:
- Checks robots.txt (cached in Redis, TTL 24h)
- Issues GET with a legitimate User-Agent string
- Follows redirects (max 5 hops)
- Enforces timeout (5s connect, 30s read)
- Respects crawl-delay from robots.txt
- Writes raw HTML + metadata to object storage
- Pushes URL + content hash to parser queue
Politeness is enforced per-domain, not per-IP. If a domain has 1,000 pages, you still wait the crawl-delay between each request to that domain.
HTML Parser
Parser workers pull from the queue, extract:
- All
<a href>links → normalized absolute URLs → dedup filter - Text content → parsed document → content store
- Canonical tag if present → override the crawled URL for dedup
- Meta robots tag (noindex/nofollow) → skip or don’t follow links
JavaScript-heavy pages are a special case. A headless Chrome pool (Puppeteer/Playwright) handles them — but it’s 10–100× more expensive per page. Only use it for high-value domains.
Duplicate Detection
Two types of dedup run in series:
URL dedup: Before fetching, check if the URL was already crawled. Use a distributed hash set (Redis Cluster or HBase). 10B URLs × 8 bytes per hash = 80GB, fits in memory.
Content dedup: After fetching, check if the content is a near-duplicate. SimHash: reduce a document to a 64-bit fingerprint based on weighted word shingles. Two documents within Hamming distance 3 are considered duplicates. Google’s Simhash paper (2007) showed this works at web scale.
Crawl Priority and Freshness
Not all pages are equal. Prioritize by:
- Domain authority: PageRank-like score from link graph. High-authority domains crawled more frequently.
- Change frequency: Track how often content changes per domain. Weather sites: every hour. Archive pages: never.
- User demand: If users search for a topic, crawl related pages more aggressively.
Googlebot’s crawl frequency for a page: roughly proportional to how often the page changes × how important it is. Pages that never change are re-crawled monthly at most.
Storage Design
Object Store (S3/GCS):
raw_html/{sha256_hash} → compressed raw HTML + headers
Metadata DB (Bigtable / HBase):
row_key = url_sha256
columns: url, last_crawled, next_crawl, content_hash, status_code, canonical_url
Parsed Index:
Passed to indexing pipeline (Elasticsearch / custom inverted index)
Trade-offs and Follow-up Questions
BFS vs DFS: BFS is almost always right for web crawling. DFS can get stuck in infinite link structures (calendars, session-parametrized URLs). BFS naturally discovers high-authority pages first.
URL traps: Calendars, infinitely paginated lists, session IDs in URLs. Mitigate with URL normalization rules (strip session parameters) and max-depth limits per domain.
Distributed coordinator: Who decides which fetcher claims which URL? Use consistent hashing on the domain name — all URLs from the same domain go to the same fetcher group, enforcing politeness without coordination.
Legal considerations: Respect robots.txt. Follow ToS for major sites. Never crawl behind login walls without permission.
Interview Follow-ups
- How do you handle JavaScript-rendered content at scale without 100× cost increase?
- Your crawler gets IP-banned by a major news site. How do you handle this gracefully?
- How do you detect and avoid crawler traps (infinite loops in link structure)?
- How would you implement incremental re-crawling — only re-fetching pages that changed?
- How does your architecture change if you need real-time indexing (page to searchable in 30 seconds)?
Related System Design Topics
- Consistent Hashing — distributing the URL frontier across crawler nodes
- Message Queues — decoupling fetcher output from parser input with async queues
- Database Sharding — partitioning the crawled URL store by domain hash
- Caching Strategies — caching robots.txt, DNS lookups, and content hashes
- CAP Theorem — eventual consistency in the URL frontier: availability over strong consistency
Companies That Ask This System Design Question
This problem type commonly appears in interviews at:
See our company interview guides for full interview process, compensation, and preparation tips.