System Design: Web Crawler — Distributed Crawling, URL Frontier, Politeness, Deduplication, robots.txt, Sitemap

A web crawler systematically browses the internet to download web pages for indexing, archiving, or data extraction. Google, Bing, and the Internet Archive operate crawlers that process billions of pages. Designing a web crawler tests your understanding of distributed systems, politeness constraints, deduplication, and priority scheduling. This guide covers the architecture of a production-scale web crawler for system design interviews.

High-Level Architecture

Components: (1) Seed URLs — the starting point. A list of known, high-quality URLs (popular websites, sitemaps, previously crawled URLs). (2) URL Frontier — a priority queue of URLs to crawl. Manages scheduling, politeness (rate limiting per domain), and priority (important pages are crawled first). (3) Fetcher — downloads web pages via HTTP. Respects robots.txt rules, handles redirects, timeouts, and retries. Runs on multiple distributed workers. (4) Content parser — extracts text content, metadata (title, description), and outbound links from the downloaded HTML. (5) URL deduplication — checks if a URL has already been crawled or is already in the frontier. Uses a Bloom filter for memory-efficient existence checking. (6) Content deduplication — detects duplicate or near-duplicate content across different URLs (mirrors, syndicated content). Uses content fingerprinting (SimHash, MinHash). (7) Storage — stores the crawled pages for indexing. Object storage (S3) for raw HTML, a database for metadata and URL state. Workflow: seed URLs -> frontier -> fetcher -> parser -> extract links -> deduplicate -> add new URLs to frontier -> repeat.

URL Frontier and Politeness

The URL frontier is the heart of the crawler — it determines what to crawl next and at what rate. Politeness: a crawler must not overwhelm a website. robots.txt specifies which paths the crawler can access and the crawl-delay (minimum time between requests to the same domain). Even without a crawl-delay, a responsible crawler limits requests to 1 per second per domain (or less for small sites). Implementation: the frontier maintains a per-domain queue. A scheduler ensures that at most one request per domain is in-flight at a time, with a minimum interval between requests to the same domain. Different domains can be crawled concurrently — the politeness constraint is per-domain, not global. Priority: not all URLs are equally important. Priority factors: PageRank (more authoritative pages first), freshness (recently updated pages should be recrawled sooner), depth (pages closer to the root are often more important), and content type (prefer HTML over PDFs or images). The frontier uses a multi-queue priority structure: high-priority URLs are dequeued first. Within the same priority, round-robin across domains to ensure breadth.

Distributed Crawling

A single machine cannot crawl the entire web (trillions of pages). Distribute across hundreds or thousands of workers. URL partitioning: assign each domain to a specific worker (hash(domain) % num_workers). This ensures politeness is maintained (only one worker crawls each domain) and simplifies per-domain rate limiting. The frontier is partitioned accordingly — each worker has its own frontier partition containing only its assigned domains. Coordination: a central coordinator (or a distributed queue like Kafka) distributes seed URLs and rebalances when workers fail. Workers periodically report their progress (URLs crawled, queue depth) to the coordinator. Fault tolerance: if a worker crashes, its assigned domains are reassigned to other workers. The frontier state (which URLs are pending) must be durable — store in a database or persistent queue, not just in-memory. Crawled URL state: store the last crawl timestamp and HTTP status for each URL. On re-crawl, use If-Modified-Since headers to avoid re-downloading unchanged pages (the server returns 304 Not Modified).

Deduplication Strategies

URL deduplication: the same page can be reached via multiple URLs (http vs https, www vs non-www, trailing slashes, URL parameters). Normalize URLs before deduplication: lowercase the scheme and host, remove default ports, remove fragment (#section), sort query parameters, and remove tracking parameters (utm_source, fbclid). After normalization, check the Bloom filter: if the URL is “probably present,” skip it. If “definitely not present,” add it to the frontier. Content deduplication: different URLs may serve the same content (mirrors, syndicated articles, URL parameters that do not change content). Compute a content fingerprint (hash of the main body text, ignoring navigation and ads). SimHash detects near-duplicates (pages that are 90%+ similar). If the fingerprint matches an already-crawled page, skip or deprioritize. robots.txt parsing: fetch and cache robots.txt for each domain (TTL 24 hours). Parse User-agent rules, Disallow paths, Allow overrides, Crawl-delay, and Sitemap directives. Respect all rules — violating robots.txt can get your crawler blocked and damages the reputation of your organization.

Recrawl Strategy and Freshness

The web changes constantly. Pages are updated, created, and deleted. A crawler must recrawl pages to keep the index fresh. Recrawl scheduling: (1) Fixed interval — recrawl every N days. Simple but wasteful (static pages are recrawled unnecessarily, rapidly changing pages are stale between crawls). (2) Adaptive — estimate each page change frequency based on historical data. Pages that change daily are recrawled daily. Pages that have not changed in a year are recrawled monthly. This focuses crawl resources on pages that actually change. (3) Sitemap-driven — many websites provide XML sitemaps with lastmod timestamps. The crawler checks the sitemap periodically and only recrawls pages with updated timestamps. Efficient but depends on accurate sitemaps. (4) Change detection signals — RSS feeds, Twitter links, and Google PubSubHubbub provide push notifications when content changes. Subscribe to signals for important sites. Budget allocation: with a crawl budget of 1 billion pages per day, allocate more budget to important domains (news sites, popular e-commerce) and less to low-value long-tail sites. Monitor index freshness metrics: the 95th percentile age of indexed pages should be under 7 days for a production search engine.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does a web crawler URL frontier work?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The URL frontier is a priority queue that determines what to crawl next and enforces politeness constraints. Structure: per-domain queues ensure only one request per domain is active at a time, with a configurable delay between requests (typically 1 second, or the crawl-delay from robots.txt). A priority selector chooses which domain queue to dequeue from next based on URL importance: PageRank, freshness, depth from seed URLs, and content type. Multi-queue implementation: high-priority URLs (news, popular sites) are dequeued first. Within the same priority, round-robin across domains for breadth. This ensures the crawler does not get stuck deep-crawling a single site while ignoring others. Scale: for a web-scale crawler processing billions of URLs, the frontier must be persistent (not in-memory) and distributed across multiple machines. Use RocksDB or a distributed queue (Kafka) for durable frontier storage.”}},{“@type”:”Question”,”name”:”How do you deduplicate URLs in a web crawler?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”URL deduplication prevents crawling the same page twice. Two levels: (1) URL normalization — the same page can be reached via multiple URL forms: http vs https, www vs non-www, trailing slashes, different parameter orders, tracking parameters. Normalize by: lowercasing scheme and host, removing default ports, removing fragments (#), sorting query parameters, removing known tracking parameters (utm_source, fbclid). (2) Bloom filter for existence check — after normalization, check a Bloom filter containing all previously seen URLs. If definitely not present, add to frontier. If probably present, skip. A Bloom filter for 10 billion URLs at 1% false positive rate uses approximately 12 GB — much less than storing all URLs in a hash set. Content deduplication catches different URLs serving identical content: use SimHash or MinHash fingerprints on the extracted text. If the fingerprint matches an already-crawled page, skip or deprioritize.”}},{“@type”:”Question”,”name”:”What is robots.txt and how must a crawler respect it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”robots.txt is a file at the root of every website (example.com/robots.txt) that specifies crawling rules. Format: User-agent directives match specific crawlers or all crawlers (*). Disallow paths that should not be crawled. Allow overrides Disallow for specific paths. Crawl-delay specifies minimum seconds between requests. Sitemap points to XML sitemaps listing all pages. A responsible crawler must: fetch robots.txt before crawling any page on a domain, cache it (refresh every 24 hours), respect all Disallow rules (never crawl disallowed paths), respect Crawl-delay (wait the specified time between requests to the domain), and follow Sitemap directives to discover pages. Violating robots.txt can result in: IP blocking by the website, legal action (some jurisdictions treat it as unauthorized access), and reputation damage. Even without explicit Crawl-delay, limit requests to 1 per second per domain for small sites. Large sites (Wikipedia, Amazon) may tolerate faster crawling.”}},{“@type”:”Question”,”name”:”How do you keep a crawled index fresh as web pages change?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Pages change constantly — a news site updates every minute while a company about page changes yearly. Recrawl strategies: (1) Adaptive frequency — estimate each page change rate from historical data. Pages that changed daily are recrawled daily; pages unchanged for a year are recrawled monthly. This focuses resources on actually changing pages. (2) Sitemap-driven — many sites provide XML sitemaps with lastmod timestamps. Check sitemaps periodically, only recrawl pages with updated timestamps. Efficient but depends on accurate sitemaps. (3) HTTP conditional requests — use If-Modified-Since or If-None-Match (ETag) headers. The server returns 304 Not Modified if the page has not changed, saving bandwidth. (4) Push signals — subscribe to RSS feeds, WebSub, or social media for notifications when important sites publish new content. Budget allocation: allocate more crawl capacity to high-value domains (news, popular e-commerce) and less to low-value long-tail sites. Target: 95th percentile page age under 7 days for a production search engine.”}}]}
Scroll to Top