Continuous Profiler Low-Level Design
A continuous profiler measures where a production service spends CPU time without requiring code changes or a performance incident as a trigger. The goal is always-on, low-overhead visibility into CPU usage that can be queried per service, per host, and per release.
Sampling vs. Instrumentation
Instrumentation inserts timing calls around every function — it is accurate but adds overhead proportional to the number of instrumented calls and requires modifying code. Sampling periodically interrupts the running process, captures the current call stack, and records which functions appear. Overhead is controlled by the sample rate and is independent of call count. A sampling profiler at 100 Hz adds roughly 1–5% CPU overhead, which is acceptable in production.
Collection Mechanism
Two implementation approaches:
- SIGPROF-based: The OS delivers SIGPROF to the process every 10 ms. The signal handler walks the call stack and records
{stack_frames[], thread_id, timestamp}. This is the classic approach used by pprof (Go) and py-spy (Python). - eBPF-based: A kernel-level program attached to perf events captures stacks across all processes without modifying them. Lower overhead than signal-based collection, no per-language SDK required, and can profile the kernel itself. Used by Pixie and Datadog's continuous profiler.
Stack Unwinding
Walking the call stack from within a signal handler requires unwinding stack frames:
- Frame pointer unwinding: Fast — follow the frame pointer chain up the stack. Requires the binary to be compiled with
-fno-omit-frame-pointer(disabled by default in optimized builds). - DWARF unwinding: Reads DWARF debug info to reconstruct frames. Works without frame pointers but is 10–20x slower per unwind. Acceptable for low-frequency sampling.
Symbolization
Raw stack samples contain instruction addresses. Symbolization maps each address to a function name using the binary's symbol table or separate debug symbols. JIT runtimes require special handling:
- JVM: JVM exposes a
/tmp/perf-{pid}.mapfile mapping JIT-compiled addresses to method names - V8 (Node.js):
--profflag or perf map file;--perf-basic-profenables kernel integration
Symbolization can happen on the host (lower network cost) or centrally (easier symbol management).
Sample Aggregation
Raw samples are aggregated over a 60-second window. For each unique stack trace observed in the window, count the number of occurrences. The result is a mapping of {stack_trace → sample_count}. This is the input to flame graph rendering. Raw samples are discarded after aggregation — storing them would require prohibitive space.
Flame Graph
A flame graph visualizes the aggregated profile:
- X-axis: stack population (sorted alphabetically, not time order) — width of each bar represents the fraction of samples in which that function appeared
- Y-axis: call depth — bottom is the entry point, top is the leaf function
- Width: percentage of CPU time attributable to that function and its callees
- Interactivity: SVG with click-to-zoom and hover tooltips showing exact sample counts and percentages
Storage
Aggregated profiles are stored in a columnar format (Parquet or Clickhouse) keyed by {service, host, time_bucket}. A 30-second time bucket granularity gives good resolution without excessive storage. Profiles older than 30 days are downsampled to 5-minute buckets for trend analysis.
Differential Profiling
Compare profiles from two time ranges or two releases side by side. The differential flame graph highlights functions that consumed more CPU in the “after” profile (red) and less (blue). This is the primary tool for identifying performance regressions introduced in a deploy — run a diff between the 1-hour window before and after the deployment.
Production Safety
- Overhead budget: measure the profiler's own CPU consumption; if it exceeds a configurable threshold (e.g., 5%), reduce sample rate automatically
- Circuit breaker: disable profiling entirely during high-load events (e.g., when host CPU > 90%) to avoid amplifying the problem
- Signal safety: the SIGPROF handler must only call async-signal-safe functions; no malloc, no locks
Continuous Baseline and Burst Mode
Run at 100 Hz (10 ms interval) always-on for continuous baseline visibility. When an engineer is actively debugging a performance issue, burst mode increases to 1000 Hz for a short window (e.g., 60 seconds) to get higher-resolution data. Burst mode is rate-limited per service to prevent simultaneous high-overhead profiling across many hosts.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does a sampling profiler work and what sampling rate is safe for production JVM services?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A sampling profiler interrupts all running threads at a fixed interval (the sampling rate), captures each thread's current call stack, and aggregates stack samples over time. Frames that appear frequently across samples are the hot paths. In a JVM, AsyncGetCallTrace (used by async-profiler) captures stacks at signal delivery time without acquiring the safepoint lock, making it safe at high rates. For CPU profiling, 100–200 Hz per thread is standard in production: at 200 Hz you get one sample per 5ms, which resolves hot functions consuming even 1% of CPU with statistical confidence after ~30 seconds. Wall-clock profiling (sampling threads regardless of CPU state) is useful for I/O and lock contention analysis. Avoid safepoint-biased profilers (JVMTI GetStackTrace under safepoint) in production — they measure safepoint cost, not actual hotspots.”
}
},
{
“@type”: “Question”,
“name”: “Describe the data structures used to build a flame graph from raw stack samples.”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use a prefix trie (also called a call tree) where each node represents a stack frame and stores a self-sample count and a total-sample count. As each captured stack (a list of frames from top to bottom) arrives, walk from the root inserting nodes as needed and increment total count at each node and self count at the leaf. After collection, serialize the trie to Brendan Gregg's folded stack format: one line per sample listing the full call chain separated by semicolons followed by a space and count. Feed this into d3-flame-graph or speedscope to render. For differential flame graphs (comparing two profiles), compute the delta of self-sample counts per unique stack path between two tries, coloring frames redder where time increased and bluer where it decreased. Keep the trie in a ring buffer of fixed depth to bound memory.”
}
},
{
“@type”: “Question”,
“name”: “What mechanisms prevent a continuous profiler from degrading production service latency?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Four mechanisms: (1) Bounded CPU overhead — use async-safe signal-based sampling (SIGPROF) so the profiler never holds locks or allocates on the hot path; target <1% CPU overhead, enforce with a profiler-side CPU budget that skips sample collection if the profiler itself is using too much CPU. (2) Adaptive sampling — reduce Hz automatically when service p99 latency rises above a threshold, using a feedback control loop that polls latency metrics every 10 seconds. (3) Memory cap — store samples in a fixed-size circular buffer; when full, drop oldest samples rather than allocating more heap, preventing GC pressure. (4) Out-of-process symbolization — resolve frame addresses to function names in a separate low-priority process after samples are captured, keeping the critical sampling path allocation-free and cache-friendly."
}
},
{
"@type": "Question",
"name": "How do you aggregate profiles from thousands of service instances into a fleet-wide view?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Each instance runs a local profiler agent that captures a profile over a 60-second window, serializes the call tree in pprof format (protocol-buffer encoded), and ships it to a central aggregation service via gRPC. The aggregation service maintains a per-service merged call trie in memory: it deserializes each incoming pprof profile and merges it into the service's trie by summing sample counts at each node. Profile uploads from individual instances are weighted equally (or by QPS if you want to represent load-proportional cost). After the collection window closes, the aggregator snapshots the merged trie to object storage (S3) as a pprof file. The UI fetches the aggregated profile on demand and renders it as a flame graph. Use consistent hashing to route all profiles for a given service to the same aggregator shard, avoiding distributed merge coordination."
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety