Error Tracking Service Low-Level Design: Exception Grouping, Fingerprinting, and Alert Deduplication

Error Tracking Service Low-Level Design

An error tracking service answers two questions: “What broke?” and “Has it broken before?” It does this by capturing exceptions from running applications, grouping recurring errors into a single issue, counting occurrences, and alerting without overwhelming on-call with duplicate notifications.

Error Capture

An SDK integrated into the application wraps the global unhandled exception handler and exposes a manual capture API. On each exception it records:

  • Exception class, message, and stack trace
  • user_id, session_id — for impact analysis
  • environment — production, staging, dev
  • release version — links errors to deploys
  • request context — URL, HTTP method, headers (sanitized)
  • tags — arbitrary key/value pairs set by the application

Capture is asynchronous — the SDK enqueues the event to a background thread and returns immediately so the request path is not blocked.

Ingestion Pipeline

The SDK ships events to an ingestion API which writes them to Kafka. A consumer reads from Kafka and performs: fingerprint computation → group lookup → occurrence increment → storage. Processing is idempotent: if the same event is delivered twice (Kafka at-least-once), the duplicate increments the counter but does not create a new issue.

Stack Trace Fingerprinting

Fingerprinting is the core of grouping. The algorithm normalizes the stack trace to remove noise before hashing:

  • Strip line numbers and memory addresses from each frame
  • Extract {module, function} pairs from each frame
  • Apply a blocklist to remove framework internals from the top of the stack
  • SHA-256 hash the resulting normalized frame list

Two exceptions from the same code path produce the same fingerprint even if they occurred at different line numbers after a minor refactor. This is the desired behavior — it keeps related errors grouped together.

Issue Grouping

The consumer looks up the fingerprint in the issues table:

  • Existing issue: increment occurrence_count, update last_seen, store a sampled copy of the full occurrence
  • New fingerprint: create an issue with status UNRESOLVED, store the first full occurrence

The issue record holds a summary of the first occurrence plus aggregated stats. Individual occurrences are sampled — store 100% of occurrences for new issues, then tail-sample to 10% after the first 1000.

Alert Deduplication

Naive alerting fires on every occurrence, causing alert fatigue. Deduplication rules:

  • Alert on new issue creation (first ever occurrence of this fingerprint)
  • Alert on regression — issue was RESOLVED, then a new occurrence arrived
  • No alert for subsequent occurrences of a known UNRESOLVED issue
  • Optional: alert on spike — occurrence rate exceeds N/minute threshold

Occurrence Storage

Two storage layers serve different query patterns:

  • Issues table (PostgreSQL): issue_id, fingerprint, title, first_seen, last_seen, occurrence_count, status, release_introduced, release_fixed
  • Occurrences table / blob store: full event payload with stack trace, context, user info — stored in S3 or Clickhouse for high write volume

Source Map Integration

JavaScript minified stack traces are unreadable. On each production deploy, upload source maps keyed by {release_version, chunk_filename}. When a minified stack trace arrives, the consumer applies the source map to translate each frame back to the original file, line, and column. Source maps are stored in S3 and fetched by the consumer at processing time.

Issue Lifecycle

  • UNRESOLVED — active, alerts fire on creation
  • RESOLVED — manually resolved by engineer, or auto-resolved after 7 days with no occurrences
  • REGRESSED — a new occurrence arrived after RESOLVED; triggers re-alert
  • IGNORED — silenced permanently or until next release

Alert Channels and Release Tracking

Deliver alerts to Slack, email, or PagerDuty based on configurable rules (e.g., ERROR rate > 100/min → PagerDuty). Each issue records which release first introduced it and, if resolved, which release fixed it, enabling regression detection across deploys and correlating error spikes with deployment events.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you fingerprint exceptions to group them into issues despite varying messages or stack frames?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Build a fingerprint from the normalized stack trace rather than the exception message, since messages often contain dynamic values (user IDs, request parameters). Normalization steps: (1) strip line numbers and memory addresses from each frame; (2) remove frames from third-party or standard-library packages that appear in virtually every trace (e.g., HTTP framework internals, thread pool boilerplate); (3) take the top N application-owned frames (typically 5–8). Hash the resulting frame sequence with SHA-256. For exceptions with no meaningful stack (e.g., timeouts), fall back to exception type + the first sentence of the message with numeric tokens replaced by '?'. Store the fingerprint on each incoming event and use it as the grouping key. Expose a manual override so engineers can re-fingerprint or merge issues that the algorithm splits incorrectly.”
}
},
{
“@type”: “Question”,
“name”: “Design the exception grouping write path for 100,000 events per second.”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Compute the fingerprint in the ingest service (CPU-bound, embarrassingly parallel — scale horizontally). Maintain a Redis hash keyed by project_id:fingerprint that stores the issue_id, first-seen timestamp, and a count. On each event, HINCRBY the count atomically — this is O(1) and Redis handles 100k ops/sec on a single node. If the key does not exist, create a new issue row in Postgres and set the Redis key with a TTL. Periodically (every 60s) flush Redis counters to Postgres in batch to keep the durable store current without per-event DB writes. Route raw events to a Kafka topic partitioned by fingerprint so a downstream consumer can update per-issue aggregates (affected users, release versions) without cross-partition coordination.”
}
},
{
“@type”: “Question”,
“name”: “How do you deduplicate alerts so on-call engineers aren't flooded when a single issue spikes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Implement alert deduplication with a per-issue alert state machine: states are CLOSED, OPEN, and SILENCED. When an issue crosses an alert threshold (e.g., >10 events in 1 minute), fire once and transition to OPEN — record the alert timestamp. Subsequent threshold breaches while OPEN do not re-alert. Transition to CLOSED when the issue drops below the threshold for a cooldown period (e.g., 10 minutes). Re-alert on transition from CLOSED→OPEN only. For regression detection — a resolved issue reappearing in a new release — always alert regardless of current state. Store alert state in Redis with Lua scripts to make state transitions atomic. Aggregate related issues (same fingerprint, different projects) into a single grouped alert to reduce noise further.”
}
},
{
“@type”: “Question”,
“name”: “How would you surface the 'most impactful' issues rather than the highest-volume ones?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Define an impact score that weights affected unique users more heavily than raw event count, since a single-user infinite retry loop should not outrank an issue touching 10,000 users. Score formula example: impact = log10(1 + affected_users) * severity_weight * recency_decay, where severity_weight is derived from exception type (OOM or data corruption = 3, unhandled 5xx = 2, handled warning = 1) and recency_decay is an exponential decay factor based on hours since last seen (so stale issues sink). Compute scores in a batch job every 5 minutes using pre-aggregated per-issue metrics from the Postgres store. Index scores in Elasticsearch for the issues list API so engineers can sort by impact with sub-100ms response time. Surface 'trending' issues separately using a derivative: issues whose score increased the fastest in the last hour.”
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Atlassian Interview Guide

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

Scroll to Top