Error Tracking Service Low-Level Design
An error tracking service answers two questions: “What broke?” and “Has it broken before?” It does this by capturing exceptions from running applications, grouping recurring errors into a single issue, counting occurrences, and alerting without overwhelming on-call with duplicate notifications.
Error Capture
An SDK integrated into the application wraps the global unhandled exception handler and exposes a manual capture API. On each exception it records:
- Exception class, message, and stack trace
- user_id, session_id — for impact analysis
- environment — production, staging, dev
- release version — links errors to deploys
- request context — URL, HTTP method, headers (sanitized)
- tags — arbitrary key/value pairs set by the application
Capture is asynchronous — the SDK enqueues the event to a background thread and returns immediately so the request path is not blocked.
Ingestion Pipeline
The SDK ships events to an ingestion API which writes them to Kafka. A consumer reads from Kafka and performs: fingerprint computation → group lookup → occurrence increment → storage. Processing is idempotent: if the same event is delivered twice (Kafka at-least-once), the duplicate increments the counter but does not create a new issue.
Stack Trace Fingerprinting
Fingerprinting is the core of grouping. The algorithm normalizes the stack trace to remove noise before hashing:
- Strip line numbers and memory addresses from each frame
- Extract
{module, function}pairs from each frame - Apply a blocklist to remove framework internals from the top of the stack
- SHA-256 hash the resulting normalized frame list
Two exceptions from the same code path produce the same fingerprint even if they occurred at different line numbers after a minor refactor. This is the desired behavior — it keeps related errors grouped together.
Issue Grouping
The consumer looks up the fingerprint in the issues table:
- Existing issue: increment
occurrence_count, updatelast_seen, store a sampled copy of the full occurrence - New fingerprint: create an issue with status UNRESOLVED, store the first full occurrence
The issue record holds a summary of the first occurrence plus aggregated stats. Individual occurrences are sampled — store 100% of occurrences for new issues, then tail-sample to 10% after the first 1000.
Alert Deduplication
Naive alerting fires on every occurrence, causing alert fatigue. Deduplication rules:
- Alert on new issue creation (first ever occurrence of this fingerprint)
- Alert on regression — issue was RESOLVED, then a new occurrence arrived
- No alert for subsequent occurrences of a known UNRESOLVED issue
- Optional: alert on spike — occurrence rate exceeds N/minute threshold
Occurrence Storage
Two storage layers serve different query patterns:
- Issues table (PostgreSQL): issue_id, fingerprint, title, first_seen, last_seen, occurrence_count, status, release_introduced, release_fixed
- Occurrences table / blob store: full event payload with stack trace, context, user info — stored in S3 or Clickhouse for high write volume
Source Map Integration
JavaScript minified stack traces are unreadable. On each production deploy, upload source maps keyed by {release_version, chunk_filename}. When a minified stack trace arrives, the consumer applies the source map to translate each frame back to the original file, line, and column. Source maps are stored in S3 and fetched by the consumer at processing time.
Issue Lifecycle
- UNRESOLVED — active, alerts fire on creation
- RESOLVED — manually resolved by engineer, or auto-resolved after 7 days with no occurrences
- REGRESSED — a new occurrence arrived after RESOLVED; triggers re-alert
- IGNORED — silenced permanently or until next release
Alert Channels and Release Tracking
Deliver alerts to Slack, email, or PagerDuty based on configurable rules (e.g., ERROR rate > 100/min → PagerDuty). Each issue records which release first introduced it and, if resolved, which release fixed it, enabling regression detection across deploys and correlating error spikes with deployment events.
See also: Atlassian Interview Guide
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering