A web crawler systematically browses the internet to download web pages for indexing, archiving, or data extraction. Google, Bing, and the Internet Archive operate crawlers that process billions of pages. Designing a web crawler tests your understanding of distributed systems, politeness constraints, deduplication, and priority scheduling. This guide covers the architecture of a production-scale web crawler for system design interviews.
High-Level Architecture
Components: (1) Seed URLs — the starting point. A list of known, high-quality URLs (popular websites, sitemaps, previously crawled URLs). (2) URL Frontier — a priority queue of URLs to crawl. Manages scheduling, politeness (rate limiting per domain), and priority (important pages are crawled first). (3) Fetcher — downloads web pages via HTTP. Respects robots.txt rules, handles redirects, timeouts, and retries. Runs on multiple distributed workers. (4) Content parser — extracts text content, metadata (title, description), and outbound links from the downloaded HTML. (5) URL deduplication — checks if a URL has already been crawled or is already in the frontier. Uses a Bloom filter for memory-efficient existence checking. (6) Content deduplication — detects duplicate or near-duplicate content across different URLs (mirrors, syndicated content). Uses content fingerprinting (SimHash, MinHash). (7) Storage — stores the crawled pages for indexing. Object storage (S3) for raw HTML, a database for metadata and URL state. Workflow: seed URLs -> frontier -> fetcher -> parser -> extract links -> deduplicate -> add new URLs to frontier -> repeat.
URL Frontier and Politeness
The URL frontier is the heart of the crawler — it determines what to crawl next and at what rate. Politeness: a crawler must not overwhelm a website. robots.txt specifies which paths the crawler can access and the crawl-delay (minimum time between requests to the same domain). Even without a crawl-delay, a responsible crawler limits requests to 1 per second per domain (or less for small sites). Implementation: the frontier maintains a per-domain queue. A scheduler ensures that at most one request per domain is in-flight at a time, with a minimum interval between requests to the same domain. Different domains can be crawled concurrently — the politeness constraint is per-domain, not global. Priority: not all URLs are equally important. Priority factors: PageRank (more authoritative pages first), freshness (recently updated pages should be recrawled sooner), depth (pages closer to the root are often more important), and content type (prefer HTML over PDFs or images). The frontier uses a multi-queue priority structure: high-priority URLs are dequeued first. Within the same priority, round-robin across domains to ensure breadth.
Distributed Crawling
A single machine cannot crawl the entire web (trillions of pages). Distribute across hundreds or thousands of workers. URL partitioning: assign each domain to a specific worker (hash(domain) % num_workers). This ensures politeness is maintained (only one worker crawls each domain) and simplifies per-domain rate limiting. The frontier is partitioned accordingly — each worker has its own frontier partition containing only its assigned domains. Coordination: a central coordinator (or a distributed queue like Kafka) distributes seed URLs and rebalances when workers fail. Workers periodically report their progress (URLs crawled, queue depth) to the coordinator. Fault tolerance: if a worker crashes, its assigned domains are reassigned to other workers. The frontier state (which URLs are pending) must be durable — store in a database or persistent queue, not just in-memory. Crawled URL state: store the last crawl timestamp and HTTP status for each URL. On re-crawl, use If-Modified-Since headers to avoid re-downloading unchanged pages (the server returns 304 Not Modified).
Deduplication Strategies
URL deduplication: the same page can be reached via multiple URLs (http vs https, www vs non-www, trailing slashes, URL parameters). Normalize URLs before deduplication: lowercase the scheme and host, remove default ports, remove fragment (#section), sort query parameters, and remove tracking parameters (utm_source, fbclid). After normalization, check the Bloom filter: if the URL is “probably present,” skip it. If “definitely not present,” add it to the frontier. Content deduplication: different URLs may serve the same content (mirrors, syndicated articles, URL parameters that do not change content). Compute a content fingerprint (hash of the main body text, ignoring navigation and ads). SimHash detects near-duplicates (pages that are 90%+ similar). If the fingerprint matches an already-crawled page, skip or deprioritize. robots.txt parsing: fetch and cache robots.txt for each domain (TTL 24 hours). Parse User-agent rules, Disallow paths, Allow overrides, Crawl-delay, and Sitemap directives. Respect all rules — violating robots.txt can get your crawler blocked and damages the reputation of your organization.
Recrawl Strategy and Freshness
The web changes constantly. Pages are updated, created, and deleted. A crawler must recrawl pages to keep the index fresh. Recrawl scheduling: (1) Fixed interval — recrawl every N days. Simple but wasteful (static pages are recrawled unnecessarily, rapidly changing pages are stale between crawls). (2) Adaptive — estimate each page change frequency based on historical data. Pages that change daily are recrawled daily. Pages that have not changed in a year are recrawled monthly. This focuses crawl resources on pages that actually change. (3) Sitemap-driven — many websites provide XML sitemaps with lastmod timestamps. The crawler checks the sitemap periodically and only recrawls pages with updated timestamps. Efficient but depends on accurate sitemaps. (4) Change detection signals — RSS feeds, Twitter links, and Google PubSubHubbub provide push notifications when content changes. Subscribe to signals for important sites. Budget allocation: with a crawl budget of 1 billion pages per day, allocate more budget to important domains (news sites, popular e-commerce) and less to low-value long-tail sites. Monitor index freshness metrics: the 95th percentile age of indexed pages should be under 7 days for a production search engine.