Data Model
CheckJob Table
CheckJob (
id,
target_domain,
status: queued/running/completed,
started_at,
completed_at,
total_links,
broken_count
)
LinkResult Table
LinkResult (
id,
job_id,
source_url,
target_url,
http_status,
redirect_chain JSONB,
latency_ms,
error_type: timeout/dns/ssl/404/500/redirect_loop,
checked_at
)
Crawl Flow
Start from seed URL → extract all anchor href and src attributes → resolve relative URLs to absolute → queue for HTTP HEAD request (prefer HEAD over GET to save bandwidth).
Redirect Handling
Follow up to 10 redirects → detect loops via seen-URL set.
External Link Handling
Check external links but do not crawl external domains.
Concurrency
N workers per job, per-domain rate limit (max 5 rps).
Retry Policy
Retry 3x with 1s delay before marking as broken.
Report Generation
Group by error_type + HTTP status, sorted by frequency.
Scheduling
Run full check weekly; watch mode re-checks known-broken links daily.
Notifications
Email report with broken link count + CSV export.
Ignore List
Exclude known third-party links that are slow but not broken.
See also: Atlassian Interview Guide
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering