Continuous Profiler Low-Level Design: Sampling Profiler, Flame Graph Generation, and Production Safety

Continuous Profiler Low-Level Design

A continuous profiler measures where a production service spends CPU time without requiring code changes or a performance incident as a trigger. The goal is always-on, low-overhead visibility into CPU usage that can be queried per service, per host, and per release.

Sampling vs. Instrumentation

Instrumentation inserts timing calls around every function — it is accurate but adds overhead proportional to the number of instrumented calls and requires modifying code. Sampling periodically interrupts the running process, captures the current call stack, and records which functions appear. Overhead is controlled by the sample rate and is independent of call count. A sampling profiler at 100 Hz adds roughly 1–5% CPU overhead, which is acceptable in production.

Collection Mechanism

Two implementation approaches:

  • SIGPROF-based: The OS delivers SIGPROF to the process every 10 ms. The signal handler walks the call stack and records {stack_frames[], thread_id, timestamp}. This is the classic approach used by pprof (Go) and py-spy (Python).
  • eBPF-based: A kernel-level program attached to perf events captures stacks across all processes without modifying them. Lower overhead than signal-based collection, no per-language SDK required, and can profile the kernel itself. Used by Pixie and Datadog's continuous profiler.

Stack Unwinding

Walking the call stack from within a signal handler requires unwinding stack frames:

  • Frame pointer unwinding: Fast — follow the frame pointer chain up the stack. Requires the binary to be compiled with -fno-omit-frame-pointer (disabled by default in optimized builds).
  • DWARF unwinding: Reads DWARF debug info to reconstruct frames. Works without frame pointers but is 10–20x slower per unwind. Acceptable for low-frequency sampling.

Symbolization

Raw stack samples contain instruction addresses. Symbolization maps each address to a function name using the binary's symbol table or separate debug symbols. JIT runtimes require special handling:

  • JVM: JVM exposes a /tmp/perf-{pid}.map file mapping JIT-compiled addresses to method names
  • V8 (Node.js): --prof flag or perf map file; --perf-basic-prof enables kernel integration

Symbolization can happen on the host (lower network cost) or centrally (easier symbol management).

Sample Aggregation

Raw samples are aggregated over a 60-second window. For each unique stack trace observed in the window, count the number of occurrences. The result is a mapping of {stack_trace → sample_count}. This is the input to flame graph rendering. Raw samples are discarded after aggregation — storing them would require prohibitive space.

Flame Graph

A flame graph visualizes the aggregated profile:

  • X-axis: stack population (sorted alphabetically, not time order) — width of each bar represents the fraction of samples in which that function appeared
  • Y-axis: call depth — bottom is the entry point, top is the leaf function
  • Width: percentage of CPU time attributable to that function and its callees
  • Interactivity: SVG with click-to-zoom and hover tooltips showing exact sample counts and percentages

Storage

Aggregated profiles are stored in a columnar format (Parquet or Clickhouse) keyed by {service, host, time_bucket}. A 30-second time bucket granularity gives good resolution without excessive storage. Profiles older than 30 days are downsampled to 5-minute buckets for trend analysis.

Differential Profiling

Compare profiles from two time ranges or two releases side by side. The differential flame graph highlights functions that consumed more CPU in the “after” profile (red) and less (blue). This is the primary tool for identifying performance regressions introduced in a deploy — run a diff between the 1-hour window before and after the deployment.

Production Safety

  • Overhead budget: measure the profiler's own CPU consumption; if it exceeds a configurable threshold (e.g., 5%), reduce sample rate automatically
  • Circuit breaker: disable profiling entirely during high-load events (e.g., when host CPU > 90%) to avoid amplifying the problem
  • Signal safety: the SIGPROF handler must only call async-signal-safe functions; no malloc, no locks

Continuous Baseline and Burst Mode

Run at 100 Hz (10 ms interval) always-on for continuous baseline visibility. When an engineer is actively debugging a performance issue, burst mode increases to 1000 Hz for a short window (e.g., 60 seconds) to get higher-resolution data. Burst mode is rate-limited per service to prevent simultaneous high-overhead profiling across many hosts.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

Scroll to Top