Continuous Profiler Low-Level Design
A continuous profiler measures where a production service spends CPU time without requiring code changes or a performance incident as a trigger. The goal is always-on, low-overhead visibility into CPU usage that can be queried per service, per host, and per release.
Sampling vs. Instrumentation
Instrumentation inserts timing calls around every function — it is accurate but adds overhead proportional to the number of instrumented calls and requires modifying code. Sampling periodically interrupts the running process, captures the current call stack, and records which functions appear. Overhead is controlled by the sample rate and is independent of call count. A sampling profiler at 100 Hz adds roughly 1–5% CPU overhead, which is acceptable in production.
Collection Mechanism
Two implementation approaches:
- SIGPROF-based: The OS delivers SIGPROF to the process every 10 ms. The signal handler walks the call stack and records
{stack_frames[], thread_id, timestamp}. This is the classic approach used by pprof (Go) and py-spy (Python). - eBPF-based: A kernel-level program attached to perf events captures stacks across all processes without modifying them. Lower overhead than signal-based collection, no per-language SDK required, and can profile the kernel itself. Used by Pixie and Datadog's continuous profiler.
Stack Unwinding
Walking the call stack from within a signal handler requires unwinding stack frames:
- Frame pointer unwinding: Fast — follow the frame pointer chain up the stack. Requires the binary to be compiled with
-fno-omit-frame-pointer(disabled by default in optimized builds). - DWARF unwinding: Reads DWARF debug info to reconstruct frames. Works without frame pointers but is 10–20x slower per unwind. Acceptable for low-frequency sampling.
Symbolization
Raw stack samples contain instruction addresses. Symbolization maps each address to a function name using the binary's symbol table or separate debug symbols. JIT runtimes require special handling:
- JVM: JVM exposes a
/tmp/perf-{pid}.mapfile mapping JIT-compiled addresses to method names - V8 (Node.js):
--profflag or perf map file;--perf-basic-profenables kernel integration
Symbolization can happen on the host (lower network cost) or centrally (easier symbol management).
Sample Aggregation
Raw samples are aggregated over a 60-second window. For each unique stack trace observed in the window, count the number of occurrences. The result is a mapping of {stack_trace → sample_count}. This is the input to flame graph rendering. Raw samples are discarded after aggregation — storing them would require prohibitive space.
Flame Graph
A flame graph visualizes the aggregated profile:
- X-axis: stack population (sorted alphabetically, not time order) — width of each bar represents the fraction of samples in which that function appeared
- Y-axis: call depth — bottom is the entry point, top is the leaf function
- Width: percentage of CPU time attributable to that function and its callees
- Interactivity: SVG with click-to-zoom and hover tooltips showing exact sample counts and percentages
Storage
Aggregated profiles are stored in a columnar format (Parquet or Clickhouse) keyed by {service, host, time_bucket}. A 30-second time bucket granularity gives good resolution without excessive storage. Profiles older than 30 days are downsampled to 5-minute buckets for trend analysis.
Differential Profiling
Compare profiles from two time ranges or two releases side by side. The differential flame graph highlights functions that consumed more CPU in the “after” profile (red) and less (blue). This is the primary tool for identifying performance regressions introduced in a deploy — run a diff between the 1-hour window before and after the deployment.
Production Safety
- Overhead budget: measure the profiler's own CPU consumption; if it exceeds a configurable threshold (e.g., 5%), reduce sample rate automatically
- Circuit breaker: disable profiling entirely during high-load events (e.g., when host CPU > 90%) to avoid amplifying the problem
- Signal safety: the SIGPROF handler must only call async-signal-safe functions; no malloc, no locks
Continuous Baseline and Burst Mode
Run at 100 Hz (10 ms interval) always-on for continuous baseline visibility. When an engineer is actively debugging a performance issue, burst mode increases to 1000 Hz for a short window (e.g., 60 seconds) to get higher-resolution data. Burst mode is rate-limited per service to prevent simultaneous high-overhead profiling across many hosts.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety