Low Level Design: Link Checker Service

Data Model

CheckJob Table

CheckJob (
  id,
  target_domain,
  status: queued/running/completed,
  started_at,
  completed_at,
  total_links,
  broken_count
)

LinkResult Table

LinkResult (
  id,
  job_id,
  source_url,
  target_url,
  http_status,
  redirect_chain JSONB,
  latency_ms,
  error_type: timeout/dns/ssl/404/500/redirect_loop,
  checked_at
)

Crawl Flow

Start from seed URL → extract all anchor href and src attributes → resolve relative URLs to absolute → queue for HTTP HEAD request (prefer HEAD over GET to save bandwidth).

Redirect Handling

Follow up to 10 redirects → detect loops via seen-URL set.

External Link Handling

Check external links but do not crawl external domains.

Concurrency

N workers per job, per-domain rate limit (max 5 rps).

Retry Policy

Retry 3x with 1s delay before marking as broken.

Report Generation

Group by error_type + HTTP status, sorted by frequency.

Scheduling

Run full check weekly; watch mode re-checks known-broken links daily.

Notifications

Email report with broken link count + CSV export.

Ignore List

Exclude known third-party links that are slow but not broken.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why use HTTP HEAD instead of GET for link checking?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “HEAD requests return only response headers without the response body, saving significant bandwidth when checking large numbers of links. The status code and redirect behavior are identical to GET, making HEAD sufficient for determining whether a link is broken.”
}
},
{
“@type”: “Question”,
“name”: “How do you detect redirect loops in a link checker?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Maintain a seen-URL set per redirect chain. If a redirect destination is already in the set, a loop is detected and the link is marked with error_type=redirect_loop. Following is also capped at a maximum of 10 redirects to handle excessive chains.”
}
},
{
“@type”: “Question”,
“name”: “How should a link checker handle external domains?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “External links are checked with HTTP HEAD requests to verify they resolve, but the checker does not crawl or follow links on external domains. An ignore list can be configured to skip known slow but valid third-party links to reduce false positives.”
}
},
{
“@type”: “Question”,
“name”: “What retry and scheduling strategy works best for link checking?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Retry each failing link 3 times with a 1-second delay before marking it as broken to filter transient errors. Run a full site check weekly. Use a watch mode that re-checks previously identified broken links daily so issues are tracked and reported promptly.”
}
}
]
}