What Is a Web Crawler?
A web crawler (spider) systematically browses the web, downloading pages and extracting links to discover new pages. Google uses Googlebot to crawl and index the web. A production crawler must handle billions of pages, respect website politeness rules, avoid duplicate content, and prioritize important pages. This is a classic system design question testing distributed systems knowledge, queue design, and politeness policy.
High-Level Architecture
- URL Frontier: priority queue of URLs to crawl, organized by priority and domain
- Fetcher: downloads web pages from URLs in the frontier
- Parser: extracts links and content from downloaded HTML
- Deduplicator: filters already-seen URLs and duplicate content
- URL Filter: removes URLs matching exclusion patterns (robots.txt, file extensions)
- Content Store: stores raw HTML and extracted text for the search index
URL Frontier Design
The URL Frontier is the most critical component. It must: (1) Prioritize high-value URLs (PageRank-like signal, freshness, update frequency). (2) Enforce politeness — do not send too many requests to one domain too fast. (3) Support massive scale — billions of URLs in the queue.
Two-queue architecture: a priority queue (front queue) sorts URLs by priority score. A politeness queue (back queue) has one sub-queue per domain; each domain is rate-limited to one request per second minimum. The crawler picks the domain whose last-fetch time is oldest, pops the top URL from that domain queue, and fetches it. This ensures no single domain is hammered while still maximizing throughput across many domains.
# Simplified URL scheduler
class URLFrontier:
def __init__(self):
self.priority_queue = [] # (priority_score, url)
self.domain_queues = {} # domain -> deque of urls
self.last_fetch = {} # domain -> timestamp
self.min_crawl_delay = 1.0 # seconds between requests per domain
def next_url(self):
now = time.time()
for domain, queue in self.domain_queues.items():
if queue and now - self.last_fetch.get(domain, 0) >= self.min_crawl_delay:
self.last_fetch[domain] = now
return queue.popleft()
return None
robots.txt and Politeness
Every crawler must respect robots.txt, a standard file at the domain root that specifies which paths are disallowed for crawlers. Fetch and cache robots.txt per domain before crawling any pages from that domain. Respect Crawl-delay directives (minimum seconds between requests). Honor Disallow rules — crawling disallowed paths can get your IP blocked or violate legal terms. Cache robots.txt with a 24-hour TTL and re-fetch periodically. A crawler that ignores robots.txt is an irresponsible crawler and will get blacklisted by CDN providers.
Deduplication
Two types of deduplication: (1) URL deduplication: avoid crawling the same URL twice. Use a distributed hash set (Redis SET) — store the SHA-256 hash of the normalized URL. Before enqueuing a URL, check if its hash is in the set. If yes, discard it. URL normalization is critical: http://example.com, http://example.com/, and http://example.com/index.html may all be the same page. (2) Content deduplication: some sites serve the same content at multiple URLs. Compute a SimHash of the page content — SimHash is a locality-sensitive hash where similar documents have similar hash values. If a new page has SimHash within a small Hamming distance of an already-indexed page, it is a near-duplicate and can be skipped or merged.
Handling JavaScript-Rendered Pages
Many modern sites use JavaScript frameworks (React, Angular, Vue) that render content client-side. A basic HTTP fetch gets only the initial HTML skeleton — no actual content. Solutions: (1) Headless browser rendering: run Chrome in headless mode (using Puppeteer or Selenium) to execute JavaScript and capture the fully-rendered DOM. High resource cost: 10x slower than raw HTTP fetch, requires 10x more compute. Use for high-priority pages. (2) Server-side rendering (SSR): many frameworks pre-render on the server. Detect SSR support and use it. (3) Structured data fallback: if the page provides JSON-LD or OpenGraph metadata, extract structured content even without rendering. Google uses Wave-1 for raw HTML crawling and a separate rendering cluster for JavaScript-heavy pages.
Scale Estimation
Web scale: 5 billion pages, recrawled every 30 days on average. Pages per second: 5B / (30 * 86400) = ~1930 pages/second. Average page size: 500KB. Bandwidth: 1930 * 500KB = ~1GB/s download. Storage for raw HTML: 5B * 500KB = 2.5 petabytes. With 100 crawler machines each doing 20 RPS: 2000 RPS total. Each machine needs 1Gbps+ network and SSDs for local URL queue caching. A DNS resolver pool is needed — 2000 new domains/second would overwhelm public DNS; use a local DNS cache or dedicated resolver cluster.
Interview Tips
- robots.txt and politeness are often overlooked — mention them early to stand out
- The URL Frontier two-queue design (priority + per-domain queues) is the key insight
- URL deduplication: Bloom filter for memory efficiency (probabilistic, fast) or Redis SET for exact matching
- Content deduplication: SimHash — interviewers rarely hear this term and it impresses
- JavaScript rendering: mention headless Chrome and the cost tradeoff
Frequently Asked Questions
How does a web crawler handle politeness and avoid overloading websites?
Politeness is enforced at two levels: (1) robots.txt compliance: fetch and cache the robots.txt file from each domain before crawling any pages. Respect Disallow directives (blocked paths) and Crawl-delay (minimum seconds between requests). Cache robots.txt for 24 hours and re-fetch periodically. (2) Per-domain rate limiting: use a back-queue architecture where each domain has its own URL queue. The scheduler selects domains round-robin, only dispatching a new request to a domain after the configured crawl delay has elapsed (typically 1 second minimum). This prevents hammering a single server regardless of how many URLs from that domain are in the frontier. Large-scale crawlers like Googlebot also identify themselves via User-Agent header and maintain published crawl policies, so website operators can configure crawl rates.
How do you avoid crawling the same page twice in a web crawler?
URL deduplication uses a hash set: normalize the URL (lowercase, remove default ports, resolve relative paths, sort query parameters), compute its SHA-256 hash, and check if it is already in a Redis SET before enqueuing. If present, discard the URL. At web scale (trillions of URLs), a Bloom filter provides memory-efficient probabilistic deduplication: a 10-billion-element Bloom filter at 1% false positive rate uses about 12GB of memory vs 400GB for an exact hash set. False positives (occasionally skipping a URL that was not actually crawled) are acceptable — the page will be picked up in a future crawl cycle. Content deduplication catches pages that have different URLs but identical content: compute SimHash of the page text and compare Hamming distance to known hashes. Near-duplicates within 3 bits are filtered.
How does a web crawler prioritize which pages to crawl first?
URL priority is a function of several signals: (1) PageRank or link count: pages with more inbound links from authoritative sites are crawled first. (2) Content freshness: pages that change frequently (news sites, stock prices) should be recrawled more often than static documentation pages. Use the Last-Modified header and ETag to detect changes without re-downloading the full page. (3) Domain authority: crawl all pages from high-authority domains (Wikipedia, major news sites) before deep-crawling low-authority blogs. (4) Discovery order: newly discovered URLs from high-priority pages inherit some priority. The URL Frontier implements this as a two-level queue: a priority queue orders URLs by score, which feeds into per-domain politeness queues that enforce rate limits. Scores are updated continuously as new link data and freshness signals are collected.
{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “How does a web crawler handle politeness and avoid overloading websites?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Politeness is enforced at two levels: (1) robots.txt compliance: fetch and cache the robots.txt file from each domain before crawling any pages. Respect Disallow directives (blocked paths) and Crawl-delay (minimum seconds between requests). Cache robots.txt for 24 hours and re-fetch periodically. (2) Per-domain rate limiting: use a back-queue architecture where each domain has its own URL queue. The scheduler selects domains round-robin, only dispatching a new request to a domain after the configured crawl delay has elapsed (typically 1 second minimum). This prevents hammering a single server regardless of how many URLs from that domain are in the frontier. Large-scale crawlers like Googlebot also identify themselves via User-Agent header and maintain published crawl policies, so website operators can configure crawl rates.” } }, { “@type”: “Question”, “name”: “How do you avoid crawling the same page twice in a web crawler?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “URL deduplication uses a hash set: normalize the URL (lowercase, remove default ports, resolve relative paths, sort query parameters), compute its SHA-256 hash, and check if it is already in a Redis SET before enqueuing. If present, discard the URL. At web scale (trillions of URLs), a Bloom filter provides memory-efficient probabilistic deduplication: a 10-billion-element Bloom filter at 1% false positive rate uses about 12GB of memory vs 400GB for an exact hash set. False positives (occasionally skipping a URL that was not actually crawled) are acceptable — the page will be picked up in a future crawl cycle. Content deduplication catches pages that have different URLs but identical content: compute SimHash of the page text and compare Hamming distance to known hashes. Near-duplicates within 3 bits are filtered.” } }, { “@type”: “Question”, “name”: “How does a web crawler prioritize which pages to crawl first?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “URL priority is a function of several signals: (1) PageRank or link count: pages with more inbound links from authoritative sites are crawled first. (2) Content freshness: pages that change frequently (news sites, stock prices) should be recrawled more often than static documentation pages. Use the Last-Modified header and ETag to detect changes without re-downloading the full page. (3) Domain authority: crawl all pages from high-authority domains (Wikipedia, major news sites) before deep-crawling low-authority blogs. (4) Discovery order: newly discovered URLs from high-priority pages inherit some priority. The URL Frontier implements this as a two-level queue: a priority queue orders URLs by score, which feeds into per-domain politeness queues that enforce rate limits. Scores are updated continuously as new link data and freshness signals are collected.” } } ] }