What is a URL frontier in a web crawler?

A URL frontier is a priority queue of URLs waiting to be crawled. It prioritizes URLs based on factors like estimated PageRank and content freshness to ensure high-value pages are crawled first.

How does a web crawler respect robots.txt?

The crawler fetches and parses robots.txt for each domain before crawling any pages. It caches the result in Redis with a 24-hour TTL per domain to avoid repeated fetches, and respects the Crawl-delay directive with a default of 1 second between requests.

How does content deduplication work in a distributed crawler?

Duplicate URLs are filtered using a bloom filter on seen URLs before enqueuing. Duplicate content is detected using SimHash of the page body, which enables near-duplicate detection even when pages differ slightly.

How do you schedule re-crawls in a web crawler?

Re-crawl frequency is based on content change rate. Static pages are re-crawled every 7 days, while frequently updated content like news pages is re-crawled every hour. The crawler tracks change frequency over time to tune intervals dynamically.

Low Level Design: Web Crawler

⏱ 2 min read

URL Frontier

Priority queue of URLs to crawl, prioritized by PageRank estimate + freshness.

CrawlJob Table

CrawlJob (
  id,
  url,
  domain,
  priority,
  status: pending/crawling/done/failed,
  last_crawled_at,
  next_crawl_at,
  http_status,
  content_hash
)

Crawl Flow

Dequeue URL → check robots.txt (cached per domain) → fetch URL → parse HTML → extract links → deduplicate links via seen-URL bloom filter → enqueue new URLs → store content.

Politeness

Per-domain crawl delay (respect Crawl-delay in robots.txt, default 1s) enforced via per-domain rate limiter.

robots.txt cache: Redis, 24h TTL per domain.

Content Deduplication

SimHash of page content → near-duplicate detection.

Distributed Architecture

Multiple crawler workers coordinate via shared URL frontier in Redis; each worker claims URLs by domain to respect politeness.

Re-crawl Scheduling

Static pages every 7d, frequently-updated pages (news) every 1h based on change frequency detection.

DNS Cache

Cache DNS lookups per domain (5min TTL) to avoid repeated lookups.