What Is a Web Crawl Scheduler?
A Web Crawl Scheduler manages the frontier of URLs to be fetched, decides the order and priority of crawls, enforces politeness constraints, and hands work off to downloader workers. It is the brain of any large-scale crawler. Without a good scheduler, a crawler either hammers a single host into the ground, re-crawls stale pages too aggressively, or misses newly discovered URLs entirely.
Data Model / Schema
-- URL frontier
CREATE TABLE frontier (
url_id BIGINT PRIMARY KEY AUTO_INCREMENT,
url TEXT NOT NULL,
url_hash CHAR(64) UNIQUE NOT NULL,
domain VARCHAR(255) NOT NULL,
priority FLOAT DEFAULT 0.5,
status ENUM('pending', 'in_flight', 'done', 'failed'),
retry_count SMALLINT DEFAULT 0,
next_fetch_at TIMESTAMP,
last_fetched TIMESTAMP,
discovered_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Per-domain politeness state
CREATE TABLE domain_policy (
domain VARCHAR(255) PRIMARY KEY,
crawl_delay_ms INT DEFAULT 1000,
robots_txt TEXT,
robots_fetched TIMESTAMP,
last_request_at TIMESTAMP,
blocked_until TIMESTAMP
);
-- Crawl job assignments
CREATE TABLE crawl_jobs (
job_id BIGINT PRIMARY KEY AUTO_INCREMENT,
url_id BIGINT REFERENCES frontier(url_id),
worker_id VARCHAR(64),
assigned_at TIMESTAMP,
deadline TIMESTAMP
);
Core Algorithm and Workflow
The scheduler operates as a continuous loop:
- Seed and Discover: Seed URLs are inserted into the frontier. As downloaders return fetched pages, extracted links are deduplicated via
url_hashand inserted with a computed priority score (based on PageRank estimate, freshness, and depth). - Priority Queue: A heap or sorted set (Redis ZSET keyed by
next_fetch_at + priority_offset) surfaces the next URL to crawl per domain. The scheduler always picks the highest-priority URL whose domain is not currently throttled. - Politeness Enforcement: Before dispatching a URL, the scheduler checks
domain_policy.blocked_until. If the domain is throttled, the URL is re-queued with a delay.robots.txtrules are cached per domain and refreshed every 24 hours. - Assignment: A selected URL transitions to
in_flightand acrawl_jobsrecord is inserted with a deadline. If the worker does not respond before the deadline, the scheduler re-enqueues the URL. - Completion: On success, the URL is marked
done,last_fetchedis updated, and a re-crawl is scheduled atnow + crawl_intervalbased on change frequency estimation.
Failure Handling
- Worker timeout: The scheduler scans for
crawl_jobspast their deadline and resets the URL topending. A dead-letter queue captures URLs that time out repeatedly. - HTTP errors: 4xx responses (except 429) set status to
failed. 429 and 5xx responses trigger exponential backoff vianext_fetch_at = now + 2^retry_count * base_delay. - Robots.txt fetch failure: If
robots.txtcannot be fetched, the scheduler applies a conservative policy (no crawl) until it succeeds, avoiding inadvertent ToS violations. - Duplicate suppression: URL normalization (lowercasing scheme/host, stripping fragments, canonical query-param ordering) before hashing prevents near-duplicate URLs from flooding the frontier.
Scalability Considerations
- Partitioning by domain: Assign each domain to a consistent scheduler shard (consistent hashing on domain). This keeps politeness state local to one shard and avoids distributed locking.
- Tiered frontier: Maintain a hot tier (Redis ZSET) for the next 5-minute fetch window and a cold tier (database) for everything else. Promote URLs from cold to hot as their
next_fetch_atapproaches. - Adaptive crawl rate: Monitor server response times and error rates per domain. Back off crawl rate automatically when a domain shows signs of overload.
- Horizontal scaling: Scheduler nodes are stateless except for their domain partition. Adding nodes rebalances partitions via consistent hashing with minimal URL reassignment.
Summary
A Web Crawl Scheduler must balance competing concerns: maximize crawl throughput, respect politeness constraints, prioritize high-value URLs, and recover gracefully from failures. Partitioning by domain, using a tiered priority queue, and applying exponential backoff with robots.txt enforcement covers the major design requirements. At interview scale, the key insight is that politeness and deduplication are first-class concerns, not afterthoughts.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a crawl scheduler and what problem does it solve?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A crawl scheduler determines which URLs to fetch, in what order, and how frequently. It solves the problem of efficiently revisiting billions of pages across the web while respecting robots.txt rules, server rate limits, and crawl politeness constraints. It balances freshness (recrawling changed content quickly) against breadth (discovering new URLs).”
}
},
{
“@type”: “Question”,
“name”: “How do you prioritize which URLs to crawl first in a crawl scheduler design?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “URL prioritization is typically based on signals such as PageRank or link popularity, change frequency estimated from past crawls, domain authority, content freshness decay, and explicit priority overrides for important domains. A priority queue (often a heap or tiered bucket queue) holds pending URLs, and schedulers assign higher-priority URLs to available crawler workers first.”
}
},
{
“@type”: “Question”,
“name”: “How do you avoid overloading a single domain when designing a crawl scheduler?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Crawl politeness is enforced by grouping URLs by hostname or IP and applying per-domain rate limits. A common pattern is to maintain a per-domain queue and enforce a minimum delay (e.g., 1–5 seconds) between successive requests to the same host. Token bucket or leaky bucket algorithms are used to smooth request rates. Robots.txt crawl-delay directives are also respected.”
}
},
{
“@type”: “Question”,
“name”: “How does a distributed crawl scheduler handle deduplication of URLs?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Deduplication prevents the same URL from being enqueued and fetched multiple times. Common approaches include a distributed hash set (e.g., Redis SET or a sharded key-value store) that tracks seen URLs, or a Bloom filter for memory-efficient approximate deduplication. URL canonicalization (normalizing query params, stripping fragments, resolving redirects) is applied before the deduplication check to reduce near-duplicate entries.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering