Low Level Design: Link Checker Service

Data Model

CheckJob Table

CheckJob (
  id,
  target_domain,
  status: queued/running/completed,
  started_at,
  completed_at,
  total_links,
  broken_count
)

LinkResult Table

LinkResult (
  id,
  job_id,
  source_url,
  target_url,
  http_status,
  redirect_chain JSONB,
  latency_ms,
  error_type: timeout/dns/ssl/404/500/redirect_loop,
  checked_at
)

Crawl Flow

Start from seed URL → extract all anchor href and src attributes → resolve relative URLs to absolute → queue for HTTP HEAD request (prefer HEAD over GET to save bandwidth).

Redirect Handling

Follow up to 10 redirects → detect loops via seen-URL set.

Check external links but do not crawl external domains.

Concurrency

N workers per job, per-domain rate limit (max 5 rps).

Retry Policy

Retry 3x with 1s delay before marking as broken.

Report Generation

Group by error_type + HTTP status, sorted by frequency.

Scheduling

Run full check weekly; watch mode re-checks known-broken links daily.

Notifications

Email report with broken link count + CSV export.

Ignore List

Exclude known third-party links that are slow but not broken.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Atlassian Interview Guide

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

Scroll to Top