Web Crawler Low-Level Design: URL Frontier, Politeness Policy, and Distributed Crawl Coordination

Web Crawler: Core Architecture

A web crawler fetches web pages at scale, extracts links, and feeds them back into the system to crawl recursively. The design challenges are politeness (not overwhelming target servers), deduplication (not crawling the same URL twice), and distribution (scaling across many worker nodes without stepping on each other).

URL Frontier

The URL frontier is the queue of URLs to crawl. A naive FIFO queue is insufficient — it sends bursts of requests to the same host, violating politeness, and ignores page priority.

Priority queue: Score URLs by freshness (time since last crawl) and authority (PageRank or inbound link count). Higher-priority URLs are crawled first.
Per-host queues: Partition URLs by hostname. Each host has its own FIFO sub-queue. A scheduler pulls from host queues with a minimum inter-request delay enforced between pulls from the same host.

-- Redis sorted set for priority frontier
ZADD frontier {priority_score} {url}

-- Per-host queue
RPUSH host_queue:{hostname} {url}

-- Last crawl time per host
SET last_crawl:{hostname} {timestamp}

Robots.txt Fetching and Caching

Before crawling any path on a domain, fetch and parse robots.txt. Cache the parsed ruleset per domain with a TTL of 24 hours. Check each candidate URL against the ruleset before enqueuing.

GET https://example.com/robots.txt
Cache-Control: max-age=86400

# Parsed ruleset stored in Redis:
SET robots:{domain} {serialized_rules} EX 86400

Honor the Crawl-delay directive. If robots.txt specifies Crawl-delay: 10, wait at least 10 seconds between requests to that domain regardless of internal scheduling pressure.

Distributed Coordination with Consistent Hashing

In a distributed crawler with N worker nodes, multiple workers must not crawl the same domain simultaneously — this violates politeness and wastes resources. Use consistent hashing to assign each domain to exactly one worker node.

worker_id = hash(domain) % num_workers

When a worker extracts a link, it routes the URL to the responsible worker via an internal message queue (e.g., Kafka partitioned by domain hash). The responsible worker owns all URLs for that domain and enforces its own crawl-delay. When workers are added or removed, consistent hashing with virtual nodes minimizes reassignment.

Duplicate URL Detection with Bloom Filters

Storing all seen URLs in a database is expensive — billions of URLs require significant storage and lookup latency. A Bloom filter provides probabilistic membership testing with O(1) lookups and compact memory usage.

False positives (URL not seen but filter says seen): acceptable — skip re-crawl. Rate is controlled by filter size and hash count.
False negatives: impossible by design. If the filter says unseen, the URL has definitely not been crawled.

A 1 billion URL Bloom filter with 1% false positive rate requires approximately 1.2 GB of memory — fits comfortably in a single Redis instance with the Bloom module.

Content Deduplication

Different URLs may serve identical or near-identical content (e.g., paginated pages, tracking params). Two strategies:

Exact deduplication: SHA-256 hash of the response body. Store hash → canonical URL mapping. If a fetched page's hash exists, mark it as a duplicate and skip indexing.
Near-duplicate detection (SimHash): Compute a 64-bit fingerprint of page content. Pages with Hamming distance ≤ 3 bits are near-duplicates. SimHash enables detecting pages that differ only in ads, timestamps, or sidebars.

content_hashes (
  sha256      CHAR(64) PRIMARY KEY,
  canonical_url TEXT,
  first_seen  TIMESTAMP
)

HTML Parsing and URL Normalization

After fetching a page, extract all <a href>, <link>, and <script src> URLs. Normalize before enqueuing:

Resolve relative URLs to absolute using the base URL of the fetched page.
Lowercase the scheme and host.
Remove default ports (port 80 for HTTP, 443 for HTTPS).
Remove URL fragments (#section) — they refer to the same resource.
Strip tracking parameters (utm_source, fbclid) using a configurable blocklist.
Canonicalize trailing slashes consistently.

DNS Pre-Resolution Pool

DNS resolution adds 20–120ms latency per unique domain. A pre-resolution pool resolves DNS for queued domains ahead of crawl time, caches results with TTL respect, and reuses connections. Maintain a pool of resolved IP→hostname mappings in memory, refreshed before TTL expiry.

Politeness Budget

Beyond crawl-delay, enforce a per-domain request rate cap: maximum N requests per second (typically 1–5 for polite crawling, higher for domains that grant permission). Implement with a token bucket per domain in Redis using a Lua script for atomic decrement.

Frontier Persistence

Redis sorted sets store the hot frontier (URLs ready to crawl soon). PostgreSQL stores the cold frontier — URLs scheduled for future recrawl, crawl history, and priority metadata.

crawl_queue (
  url          TEXT PRIMARY KEY,
  domain       TEXT,
  priority     FLOAT,
  next_crawl   TIMESTAMP,
  last_crawled TIMESTAMP,
  crawl_count  INT
)

A loader job periodically promotes URLs from PostgreSQL to Redis as their next_crawl time approaches.

Trade-offs and Failure Modes

Crawler traps: Infinite URL spaces (calendars, session IDs in URLs). Mitigate with max URL depth, URL pattern normalization, and domain-level crawl caps.
Bloom filter growth: Bloom filters cannot delete entries. Use a rotating double-filter scheme: when the current filter hits 80% capacity, start a new one and retire the old after the current crawl cycle completes.
Worker node failure: When a worker dies, its domain assignments must be redistributed. Consistent hashing makes this automatic for new work. In-progress crawls need a lease/heartbeat mechanism to detect and reassign stalled work.
robots.txt changes: A site may tighten robots.txt after initial caching. Short TTLs (6–24 hours) balance performance with compliance. Critical to re-check before re-crawling a domain after a long gap.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does the URL frontier prioritize which URLs to crawl next?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The URL frontier maintains per-host back queues and a priority front queue, assigning scores based on PageRank estimate, freshness decay, and content-type weight. A politeness selector pops the highest-priority URL whose host's next-allowed-fetch time has elapsed, ensuring both importance-driven ordering and per-domain crawl-delay compliance.”
}
},
{
“@type”: “Question”,
“name”: “How does consistent hashing prevent duplicate crawling in a distributed crawler?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each crawl worker is assigned a token range on a consistent hash ring keyed by the normalized URL's domain or canonical URL hash, so every URL maps deterministically to exactly one worker. When a new worker joins or leaves, only the URLs in the affected range are reassigned, minimizing reshuffling and avoiding duplicate fetches from two workers claiming the same URL.”
}
},
{
“@type”: “Question”,
“name”: “How are duplicate URLs detected efficiently?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A distributed Bloom filter (e.g., backed by Redis bit arrays) provides O(1) probabilistic membership checks with a configurable false-positive rate, filtering the vast majority of already-seen URLs before they reach the fetch queue. For content-level deduplication, SimHash fingerprints of the rendered page body are stored and compared using Hamming distance to detect near-duplicate pages with different URLs.”
}
},
{
“@type”: “Question”,
“name”: “How is robots.txt politeness enforced at scale?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each worker caches parsed robots.txt rules per domain with a TTL matching the Crawl-delay directive or a default (e.g., 60 seconds), and checks allow/disallow rules before every fetch. Per-domain rate limiters using a token bucket algorithm enforce the Crawl-delay value, and a centralized Redis key tracks the last-fetch timestamp per host to coordinate delays across distributed worker processes.”
}
}
]
}