URL Frontier
Priority queue of URLs to crawl, prioritized by PageRank estimate + freshness.
CrawlJob Table
CrawlJob (
id,
url,
domain,
priority,
status: pending/crawling/done/failed,
last_crawled_at,
next_crawl_at,
http_status,
content_hash
)
Crawl Flow
Dequeue URL → check robots.txt (cached per domain) → fetch URL → parse HTML → extract links → deduplicate links via seen-URL bloom filter → enqueue new URLs → store content.
Politeness
Per-domain crawl delay (respect Crawl-delay in robots.txt, default 1s) enforced via per-domain rate limiter.
robots.txt cache: Redis, 24h TTL per domain.
Content Deduplication
SimHash of page content → near-duplicate detection.
Distributed Architecture
Multiple crawler workers coordinate via shared URL frontier in Redis; each worker claims URLs by domain to respect politeness.
Re-crawl Scheduling
Static pages every 7d, frequently-updated pages (news) every 1h based on change frequency detection.
DNS Cache
Cache DNS lookups per domain (5min TTL) to avoid repeated lookups.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a URL frontier in a web crawler?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A URL frontier is a priority queue of URLs waiting to be crawled. It prioritizes URLs based on factors like estimated PageRank and content freshness to ensure high-value pages are crawled first.”
}
},
{
“@type”: “Question”,
“name”: “How does a web crawler respect robots.txt?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The crawler fetches and parses robots.txt for each domain before crawling any pages. It caches the result in Redis with a 24-hour TTL per domain to avoid repeated fetches, and respects the Crawl-delay directive with a default of 1 second between requests.”
}
},
{
“@type”: “Question”,
“name”: “How does content deduplication work in a distributed crawler?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Duplicate URLs are filtered using a bloom filter on seen URLs before enqueuing. Duplicate content is detected using SimHash of the page body, which enables near-duplicate detection even when pages differ slightly.”
}
},
{
“@type”: “Question”,
“name”: “How do you schedule re-crawls in a web crawler?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Re-crawl frequency is based on content change rate. Static pages are re-crawled every 7 days, while frequently updated content like news pages is re-crawled every hour. The crawler tracks change frequency over time to tune intervals dynamically.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Atlassian Interview Guide
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering