Question 1

How does a web crawler URL frontier work?

Accepted Answer

The URL frontier is a priority queue that determines what to crawl next and enforces politeness constraints. Structure: per-domain queues ensure only one request per domain is active at a time, with a configurable delay between requests (typically 1 second, or the crawl-delay from robots.txt). A priority selector chooses which domain queue to dequeue from next based on URL importance: PageRank, freshness, depth from seed URLs, and content type. Multi-queue implementation: high-priority URLs (news, popular sites) are dequeued first. Within the same priority, round-robin across domains for breadth. This ensures the crawler does not get stuck deep-crawling a single site while ignoring others. Scale: for a web-scale crawler processing billions of URLs, the frontier must be persistent (not in-memory) and distributed across multiple machines. Use RocksDB or a distributed queue (Kafka) for durable frontier storage.

Question 2

How do you deduplicate URLs in a web crawler?

Accepted Answer

URL deduplication prevents crawling the same page twice. Two levels: (1) URL normalization -- the same page can be reached via multiple URL forms: http vs https, www vs non-www, trailing slashes, different parameter orders, tracking parameters. Normalize by: lowercasing scheme and host, removing default ports, removing fragments (#), sorting query parameters, removing known tracking parameters (utm_source, fbclid). (2) Bloom filter for existence check -- after normalization, check a Bloom filter containing all previously seen URLs. If definitely not present, add to frontier. If probably present, skip. A Bloom filter for 10 billion URLs at 1% false positive rate uses approximately 12 GB -- much less than storing all URLs in a hash set. Content deduplication catches different URLs serving identical content: use SimHash or MinHash fingerprints on the extracted text. If the fingerprint matches an already-crawled page, skip or deprioritize.

Question 3

What is robots.txt and how must a crawler respect it?

Accepted Answer

robots.txt is a file at the root of every website (example.com/robots.txt) that specifies crawling rules. Format: User-agent directives match specific crawlers or all crawlers (*). Disallow paths that should not be crawled. Allow overrides Disallow for specific paths. Crawl-delay specifies minimum seconds between requests. Sitemap points to XML sitemaps listing all pages. A responsible crawler must: fetch robots.txt before crawling any page on a domain, cache it (refresh every 24 hours), respect all Disallow rules (never crawl disallowed paths), respect Crawl-delay (wait the specified time between requests to the domain), and follow Sitemap directives to discover pages. Violating robots.txt can result in: IP blocking by the website, legal action (some jurisdictions treat it as unauthorized access), and reputation damage. Even without explicit Crawl-delay, limit requests to 1 per second per domain for small sites. Large sites (Wikipedia, Amazon) may tolerate faster crawling.

Question 4

How do you keep a crawled index fresh as web pages change?

Accepted Answer

Pages change constantly -- a news site updates every minute while a company about page changes yearly. Recrawl strategies: (1) Adaptive frequency -- estimate each page change rate from historical data. Pages that changed daily are recrawled daily; pages unchanged for a year are recrawled monthly. This focuses resources on actually changing pages. (2) Sitemap-driven -- many sites provide XML sitemaps with lastmod timestamps. Check sitemaps periodically, only recrawl pages with updated timestamps. Efficient but depends on accurate sitemaps. (3) HTTP conditional requests -- use If-Modified-Since or If-None-Match (ETag) headers. The server returns 304 Not Modified if the page has not changed, saving bandwidth. (4) Push signals -- subscribe to RSS feeds, WebSub, or social media for notifications when important sites publish new content. Budget allocation: allocate more crawl capacity to high-value domains (news, popular e-commerce) and less to low-value long-tail sites. Target: 95th percentile page age under 7 days for a production search engine.

System Design: Web Crawler — Distributed Crawling, URL Frontier, Politeness, Deduplication, robots.txt, Sitemap

High-Level Architecture

URL Frontier and Politeness

Distributed Crawling

Deduplication Strategies

Recrawl Strategy and Freshness