Design Mobile Crash Reporting and Telemetry

Crash reporting (Crashlytics, Sentry, Bugsnag) is a system design topic that touches mobile SDK design, server-side processing of millions of events, symbolication of native code, and the dashboards engineers actually use to debug. The interview tests whether you understand the layers and the engineering tradeoffs.

Functional requirements

  • Capture crashes on the device
  • Capture non-fatal errors and exceptions
  • Capture custom telemetry (breadcrumbs, custom events)
  • Send to backend
  • Dedupe similar crashes; group by signature
  • Symbolicate (turn raw stack frames into source-level frames)
  • Surface in a dashboard

SDK design

The mobile SDK installs handlers for:

  • Uncaught exceptions (Java/Kotlin runtime exceptions on Android, NSException on iOS)
  • Native crashes (signal handlers — SIGSEGV, SIGABRT, etc.)
  • ANRs (Application Not Responding) on Android
  • Watchdog terminations on iOS (more difficult to capture)

SDK constraints:

  • Tiny binary impact (target <500KB)
  • Zero startup time impact
  • Robust to its own crashes (the crash reporter must not crash)

Capture mechanics

When a crash occurs:

  1. SDK signal handler runs in a constrained context (no malloc, limited APIs)
  2. Captures crash metadata: stack frames, registers, threads, current breadcrumbs
  3. Writes to local disk (cannot send network from a crash handler)
  4. Process terminates
  5. On next app launch, SDK detects pending crash report and uploads

Symbolication

Native crash stacks are addresses, not function names:

0x100012345
0x100023456

Symbolication maps addresses to source-level frames using debug symbols (dSYMs on iOS, mapping files on Android).

Symbolication happens server-side after upload. Symbol files are uploaded by the build pipeline; the server cross-references on incoming crashes.

Dedup and grouping

Many crashes are the same bug from many users. Group by:

  • Top 3–5 stack frames (after stripping non-app frames)
  • Exception type
  • App version

Each group is one issue in the dashboard. Users count is the impact metric.

Server architecture

Three pipelines:

  1. Ingest: high-volume HTTP endpoint, queue events
  2. Process: symbolicate, group, persist
  3. Serve: dashboard queries

Volume scales fast: 1B events/day at the larger crash reporters. Architecture similar to Sentry, Datadog ingest pipelines.

Storage

  • Hot store: recent events, dashboard-queryable (ClickHouse, Druid)
  • Cold store: archived events for forensic analysis (S3 + Parquet)
  • Metadata DB: issue metadata, user-issue mapping (Postgres)

The dashboard

Engineers want:

  • Sort issues by impact (users affected, occurrences)
  • Filter by app version, OS, device
  • Drill into a single occurrence with breadcrumbs
  • Mark issue as resolved; track regression detection

Privacy

  • Strip PII from breadcrumbs and custom data automatically
  • Allow opt-out per user
  • Honor regional regulations (GDPR, CCPA)
  • Never log auth tokens, passwords, or sensitive customer data

Frequently Asked Questions

Why do some crashes never appear in the dashboard?

Crashes during app startup may not have a chance to write to disk. Watchdog kills (long main thread blocks) are not signal-handler-detectable on iOS. Both leave gaps.

How long does symbolication take?

Sub-second to a few seconds typically. If symbol files are large or missing, can take longer or fail.

How does crash reporting differ from APM?

Crash reporting captures fatal events. APM (DataDog, NewRelic) captures performance and traces. Increasingly the same tools cover both, but they began as separate disciplines.

Scroll to Top