Low Level Design: Web Crawl Scheduler – Tech Interview Dot Org

What Is a Web Crawl Scheduler?

A Web Crawl Scheduler manages the frontier of URLs to be fetched, decides the order and priority of crawls, enforces politeness constraints, and hands work off to downloader workers. It is the brain of any large-scale crawler. Without a good scheduler, a crawler either hammers a single host into the ground, re-crawls stale pages too aggressively, or misses newly discovered URLs entirely.

Data Model / Schema

-- URL frontier
CREATE TABLE frontier (
    url_id        BIGINT PRIMARY KEY AUTO_INCREMENT,
    url           TEXT NOT NULL,
    url_hash      CHAR(64) UNIQUE NOT NULL,
    domain        VARCHAR(255) NOT NULL,
    priority      FLOAT DEFAULT 0.5,
    status        ENUM('pending', 'in_flight', 'done', 'failed'),
    retry_count   SMALLINT DEFAULT 0,
    next_fetch_at TIMESTAMP,
    last_fetched  TIMESTAMP,
    discovered_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Per-domain politeness state
CREATE TABLE domain_policy (
    domain          VARCHAR(255) PRIMARY KEY,
    crawl_delay_ms  INT DEFAULT 1000,
    robots_txt      TEXT,
    robots_fetched  TIMESTAMP,
    last_request_at TIMESTAMP,
    blocked_until   TIMESTAMP
);

-- Crawl job assignments
CREATE TABLE crawl_jobs (
    job_id      BIGINT PRIMARY KEY AUTO_INCREMENT,
    url_id      BIGINT REFERENCES frontier(url_id),
    worker_id   VARCHAR(64),
    assigned_at TIMESTAMP,
    deadline    TIMESTAMP
);

Core Algorithm and Workflow

The scheduler operates as a continuous loop:

Seed and Discover: Seed URLs are inserted into the frontier. As downloaders return fetched pages, extracted links are deduplicated via url_hash and inserted with a computed priority score (based on PageRank estimate, freshness, and depth).
Priority Queue: A heap or sorted set (Redis ZSET keyed by next_fetch_at + priority_offset) surfaces the next URL to crawl per domain. The scheduler always picks the highest-priority URL whose domain is not currently throttled.
Politeness Enforcement: Before dispatching a URL, the scheduler checks domain_policy.blocked_until. If the domain is throttled, the URL is re-queued with a delay. robots.txt rules are cached per domain and refreshed every 24 hours.
Assignment: A selected URL transitions to in_flight and a crawl_jobs record is inserted with a deadline. If the worker does not respond before the deadline, the scheduler re-enqueues the URL.
Completion: On success, the URL is marked done, last_fetched is updated, and a re-crawl is scheduled at now + crawl_interval based on change frequency estimation.

Failure Handling

Worker timeout: The scheduler scans for crawl_jobs past their deadline and resets the URL to pending. A dead-letter queue captures URLs that time out repeatedly.
HTTP errors: 4xx responses (except 429) set status to failed. 429 and 5xx responses trigger exponential backoff via next_fetch_at = now + 2^retry_count * base_delay.
Robots.txt fetch failure: If robots.txt cannot be fetched, the scheduler applies a conservative policy (no crawl) until it succeeds, avoiding inadvertent ToS violations.
Duplicate suppression: URL normalization (lowercasing scheme/host, stripping fragments, canonical query-param ordering) before hashing prevents near-duplicate URLs from flooding the frontier.

Scalability Considerations

Partitioning by domain: Assign each domain to a consistent scheduler shard (consistent hashing on domain). This keeps politeness state local to one shard and avoids distributed locking.
Tiered frontier: Maintain a hot tier (Redis ZSET) for the next 5-minute fetch window and a cold tier (database) for everything else. Promote URLs from cold to hot as their next_fetch_at approaches.
Adaptive crawl rate: Monitor server response times and error rates per domain. Back off crawl rate automatically when a domain shows signs of overload.
Horizontal scaling: Scheduler nodes are stateless except for their domain partition. Adding nodes rebalances partitions via consistent hashing with minimal URL reassignment.

Summary

A Web Crawl Scheduler must balance competing concerns: maximize crawl throughput, respect politeness constraints, prioritize high-value URLs, and recover gracefully from failures. Partitioning by domain, using a tiered priority queue, and applying exponential backoff with robots.txt enforcement covers the major design requirements. At interview scale, the key insight is that politeness and deduplication are first-class concerns, not afterthoughts.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a crawl scheduler and what problem does it solve?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A crawl scheduler determines which URLs to fetch, in what order, and how frequently. It solves the problem of efficiently revisiting billions of pages across the web while respecting robots.txt rules, server rate limits, and crawl politeness constraints. It balances freshness (recrawling changed content quickly) against breadth (discovering new URLs).”
}
},
{
“@type”: “Question”,
“name”: “How do you prioritize which URLs to crawl first in a crawl scheduler design?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “URL prioritization is typically based on signals such as PageRank or link popularity, change frequency estimated from past crawls, domain authority, content freshness decay, and explicit priority overrides for important domains. A priority queue (often a heap or tiered bucket queue) holds pending URLs, and schedulers assign higher-priority URLs to available crawler workers first.”
}
},
{
“@type”: “Question”,
“name”: “How do you avoid overloading a single domain when designing a crawl scheduler?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Crawl politeness is enforced by grouping URLs by hostname or IP and applying per-domain rate limits. A common pattern is to maintain a per-domain queue and enforce a minimum delay (e.g., 1–5 seconds) between successive requests to the same host. Token bucket or leaky bucket algorithms are used to smooth request rates. Robots.txt crawl-delay directives are also respected.”
}
},
{
“@type”: “Question”,
“name”: “How does a distributed crawl scheduler handle deduplication of URLs?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Deduplication prevents the same URL from being enqueued and fetched multiple times. Common approaches include a distributed hash set (e.g., Redis SET or a sharded key-value store) that tracks seen URLs, or a Bloom filter for memory-efficient approximate deduplication. URL canonicalization (normalizing query params, stripping fragments, resolving redirects) is applied before the deduplication check to reduce near-duplicate entries.”
}
}
]
}