Question 1

How do you fingerprint exceptions to group them into issues despite varying messages or stack frames?

Accepted Answer

Build a fingerprint from the normalized stack trace rather than the exception message, since messages often contain dynamic values (user IDs, request parameters). Normalization steps: (1) strip line numbers and memory addresses from each frame; (2) remove frames from third-party or standard-library packages that appear in virtually every trace (e.g., HTTP framework internals, thread pool boilerplate); (3) take the top N application-owned frames (typically 5–8). Hash the resulting frame sequence with SHA-256. For exceptions with no meaningful stack (e.g., timeouts), fall back to exception type + the first sentence of the message with numeric tokens replaced by '?'. Store the fingerprint on each incoming event and use it as the grouping key. Expose a manual override so engineers can re-fingerprint or merge issues that the algorithm splits incorrectly.

Question 2

Design the exception grouping write path for 100,000 events per second.

Accepted Answer

Compute the fingerprint in the ingest service (CPU-bound, embarrassingly parallel — scale horizontally). Maintain a Redis hash keyed by project_id:fingerprint that stores the issue_id, first-seen timestamp, and a count. On each event, HINCRBY the count atomically — this is O(1) and Redis handles 100k ops/sec on a single node. If the key does not exist, create a new issue row in Postgres and set the Redis key with a TTL. Periodically (every 60s) flush Redis counters to Postgres in batch to keep the durable store current without per-event DB writes. Route raw events to a Kafka topic partitioned by fingerprint so a downstream consumer can update per-issue aggregates (affected users, release versions) without cross-partition coordination.

Question 3

How do you deduplicate alerts so on-call engineers aren't flooded when a single issue spikes?

Accepted Answer

Implement alert deduplication with a per-issue alert state machine: states are CLOSED, OPEN, and SILENCED. When an issue crosses an alert threshold (e.g., >10 events in 1 minute), fire once and transition to OPEN — record the alert timestamp. Subsequent threshold breaches while OPEN do not re-alert. Transition to CLOSED when the issue drops below the threshold for a cooldown period (e.g., 10 minutes). Re-alert on transition from CLOSED→OPEN only. For regression detection — a resolved issue reappearing in a new release — always alert regardless of current state. Store alert state in Redis with Lua scripts to make state transitions atomic. Aggregate related issues (same fingerprint, different projects) into a single grouped alert to reduce noise further.

Question 4

How would you surface the 'most impactful' issues rather than the highest-volume ones?

Accepted Answer

Define an impact score that weights affected unique users more heavily than raw event count, since a single-user infinite retry loop should not outrank an issue touching 10,000 users. Score formula example: impact = log10(1 + affected_users) * severity_weight * recency_decay, where severity_weight is derived from exception type (OOM or data corruption = 3, unhandled 5xx = 2, handled warning = 1) and recency_decay is an exponential decay factor based on hours since last seen (so stale issues sink). Compute scores in a batch job every 5 minutes using pre-aggregated per-issue metrics from the Postgres store. Index scores in Elasticsearch for the issues list API so engineers can sort by impact with sub-100ms response time. Surface 'trending' issues separately using a derivative: issues whose score increased the fastest in the last hour.

Error Tracking Service Low-Level Design: Exception Grouping, Fingerprinting, and Alert Deduplication

Error Tracking Service Low-Level Design

Error Capture

Ingestion Pipeline

Stack Trace Fingerprinting

Issue Grouping

Alert Deduplication

Occurrence Storage

Source Map Integration

Issue Lifecycle

Alert Channels and Release Tracking