URL Frontier
Priority queue of URLs to crawl, prioritized by PageRank estimate + freshness.
CrawlJob Table
CrawlJob (
id,
url,
domain,
priority,
status: pending/crawling/done/failed,
last_crawled_at,
next_crawl_at,
http_status,
content_hash
)
Crawl Flow
Dequeue URL → check robots.txt (cached per domain) → fetch URL → parse HTML → extract links → deduplicate links via seen-URL bloom filter → enqueue new URLs → store content.
Politeness
Per-domain crawl delay (respect Crawl-delay in robots.txt, default 1s) enforced via per-domain rate limiter.
robots.txt cache: Redis, 24h TTL per domain.
Content Deduplication
SimHash of page content → near-duplicate detection.
Distributed Architecture
Multiple crawler workers coordinate via shared URL frontier in Redis; each worker claims URLs by domain to respect politeness.
Re-crawl Scheduling
Static pages every 7d, frequently-updated pages (news) every 1h based on change frequency detection.
DNS Cache
Cache DNS lookups per domain (5min TTL) to avoid repeated lookups.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Atlassian Interview Guide
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering