Low Level Design: Web Crawler

URL Frontier

Priority queue of URLs to crawl, prioritized by PageRank estimate + freshness.

CrawlJob Table

CrawlJob (
  id,
  url,
  domain,
  priority,
  status: pending/crawling/done/failed,
  last_crawled_at,
  next_crawl_at,
  http_status,
  content_hash
)

Crawl Flow

Dequeue URL → check robots.txt (cached per domain) → fetch URL → parse HTML → extract links → deduplicate links via seen-URL bloom filter → enqueue new URLs → store content.

Politeness

Per-domain crawl delay (respect Crawl-delay in robots.txt, default 1s) enforced via per-domain rate limiter.

robots.txt cache: Redis, 24h TTL per domain.

Content Deduplication

SimHash of page content → near-duplicate detection.

Distributed Architecture

Multiple crawler workers coordinate via shared URL frontier in Redis; each worker claims URLs by domain to respect politeness.

Re-crawl Scheduling

Static pages every 7d, frequently-updated pages (news) every 1h based on change frequency detection.

DNS Cache

Cache DNS lookups per domain (5min TTL) to avoid repeated lookups.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Atlassian Interview Guide

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top