Data Model
CheckJob Table
CheckJob (
id,
target_domain,
status: queued/running/completed,
started_at,
completed_at,
total_links,
broken_count
)
LinkResult Table
LinkResult (
id,
job_id,
source_url,
target_url,
http_status,
redirect_chain JSONB,
latency_ms,
error_type: timeout/dns/ssl/404/500/redirect_loop,
checked_at
)
Crawl Flow
Start from seed URL → extract all anchor href and src attributes → resolve relative URLs to absolute → queue for HTTP HEAD request (prefer HEAD over GET to save bandwidth).
Redirect Handling
Follow up to 10 redirects → detect loops via seen-URL set.
External Link Handling
Check external links but do not crawl external domains.
Concurrency
N workers per job, per-domain rate limit (max 5 rps).
Retry Policy
Retry 3x with 1s delay before marking as broken.
Report Generation
Group by error_type + HTTP status, sorted by frequency.
Scheduling
Run full check weekly; watch mode re-checks known-broken links daily.
Notifications
Email report with broken link count + CSV export.
Ignore List
Exclude known third-party links that are slow but not broken.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why use HTTP HEAD instead of GET for link checking?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “HEAD requests return only response headers without the response body, saving significant bandwidth when checking large numbers of links. The status code and redirect behavior are identical to GET, making HEAD sufficient for determining whether a link is broken.”
}
},
{
“@type”: “Question”,
“name”: “How do you detect redirect loops in a link checker?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Maintain a seen-URL set per redirect chain. If a redirect destination is already in the set, a loop is detected and the link is marked with error_type=redirect_loop. Following is also capped at a maximum of 10 redirects to handle excessive chains.”
}
},
{
“@type”: “Question”,
“name”: “How should a link checker handle external domains?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “External links are checked with HTTP HEAD requests to verify they resolve, but the checker does not crawl or follow links on external domains. An ignore list can be configured to skip known slow but valid third-party links to reduce false positives.”
}
},
{
“@type”: “Question”,
“name”: “What retry and scheduling strategy works best for link checking?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Retry each failing link 3 times with a 1-second delay before marking it as broken to filter transient errors. Run a full site check weekly. Use a watch mode that re-checks previously identified broken links daily so issues are tracked and reported promptly.”
}
}
]
}
See also: Atlassian Interview Guide
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering