Error Tracking Service Low-Level Design: Exception Grouping, Fingerprinting, and Alert Deduplication

Error Tracking Service Low-Level Design

An error tracking service answers two questions: “What broke?” and “Has it broken before?” It does this by capturing exceptions from running applications, grouping recurring errors into a single issue, counting occurrences, and alerting without overwhelming on-call with duplicate notifications.

Error Capture

An SDK integrated into the application wraps the global unhandled exception handler and exposes a manual capture API. On each exception it records:

  • Exception class, message, and stack trace
  • user_id, session_id — for impact analysis
  • environment — production, staging, dev
  • release version — links errors to deploys
  • request context — URL, HTTP method, headers (sanitized)
  • tags — arbitrary key/value pairs set by the application

Capture is asynchronous — the SDK enqueues the event to a background thread and returns immediately so the request path is not blocked.

Ingestion Pipeline

The SDK ships events to an ingestion API which writes them to Kafka. A consumer reads from Kafka and performs: fingerprint computation → group lookup → occurrence increment → storage. Processing is idempotent: if the same event is delivered twice (Kafka at-least-once), the duplicate increments the counter but does not create a new issue.

Stack Trace Fingerprinting

Fingerprinting is the core of grouping. The algorithm normalizes the stack trace to remove noise before hashing:

  • Strip line numbers and memory addresses from each frame
  • Extract {module, function} pairs from each frame
  • Apply a blocklist to remove framework internals from the top of the stack
  • SHA-256 hash the resulting normalized frame list

Two exceptions from the same code path produce the same fingerprint even if they occurred at different line numbers after a minor refactor. This is the desired behavior — it keeps related errors grouped together.

Issue Grouping

The consumer looks up the fingerprint in the issues table:

  • Existing issue: increment occurrence_count, update last_seen, store a sampled copy of the full occurrence
  • New fingerprint: create an issue with status UNRESOLVED, store the first full occurrence

The issue record holds a summary of the first occurrence plus aggregated stats. Individual occurrences are sampled — store 100% of occurrences for new issues, then tail-sample to 10% after the first 1000.

Alert Deduplication

Naive alerting fires on every occurrence, causing alert fatigue. Deduplication rules:

  • Alert on new issue creation (first ever occurrence of this fingerprint)
  • Alert on regression — issue was RESOLVED, then a new occurrence arrived
  • No alert for subsequent occurrences of a known UNRESOLVED issue
  • Optional: alert on spike — occurrence rate exceeds N/minute threshold

Occurrence Storage

Two storage layers serve different query patterns:

  • Issues table (PostgreSQL): issue_id, fingerprint, title, first_seen, last_seen, occurrence_count, status, release_introduced, release_fixed
  • Occurrences table / blob store: full event payload with stack trace, context, user info — stored in S3 or Clickhouse for high write volume

Source Map Integration

JavaScript minified stack traces are unreadable. On each production deploy, upload source maps keyed by {release_version, chunk_filename}. When a minified stack trace arrives, the consumer applies the source map to translate each frame back to the original file, line, and column. Source maps are stored in S3 and fetched by the consumer at processing time.

Issue Lifecycle

  • UNRESOLVED — active, alerts fire on creation
  • RESOLVED — manually resolved by engineer, or auto-resolved after 7 days with no occurrences
  • REGRESSED — a new occurrence arrived after RESOLVED; triggers re-alert
  • IGNORED — silenced permanently or until next release

Alert Channels and Release Tracking

Deliver alerts to Slack, email, or PagerDuty based on configurable rules (e.g., ERROR rate > 100/min → PagerDuty). Each issue records which release first introduced it and, if resolved, which release fixed it, enabling regression detection across deploys and correlating error spikes with deployment events.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Atlassian Interview Guide

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

Scroll to Top