Q: How do you detect and troubleshoot memory leaks in production?

Detection: monitor the process RSS (Resident Set Size) over time using top or a monitoring system. If RSS grows monotonically without plateau, there is a leak. The free -h command shows system-wide memory. If available memory steadily decreases, some process is leaking. Diagnosis: (1) Check vmstat 1 for swap activity (si/so non-zero means memory pressure is causing swapping). (2) Check dmesg for OOM killer messages (Out of memory: Kill process). (3) Language-specific profiling: Java -- take a heap dump with jmap, analyze with Eclipse MAT (look for dominator trees). Python -- use tracemalloc to track allocations. Go -- pprof heap profile (go tool pprof). Node.js -- --inspect flag + Chrome DevTools heap snapshot. Common leak causes: unbounded caches (maps that grow forever), unremoved event listeners, ThreadLocal values in thread pools, and unclosed database connections.

Question 1

What is the first thing to check when a production system is slow?

Accepted Answer

Follow this systematic checklist: (1) uptime -- check load average. If load >> CPU cores, system is overloaded. (2) dmesg | tail -- kernel errors (OOM kills, disk errors). (3) free -h -- check available memory. Near zero = memory pressure, likely swapping. (4) vmstat 1 -- check si/so (swap activity, non-zero = bad), wa (I/O wait > 5% = disk bottleneck), r (run queue >> cores = CPU saturated). (5) iostat -xz 1 -- disk utilization and latency. %util near 100% = disk bottleneck. (6) ss -s -- connection issues (high TIME_WAIT or CLOSE_WAIT). (7) top/htop -- identify the hot process. (8) If CPU bottleneck: generate a flame graph. If I/O: check iostat + iotop + lsof. This isolates the bottleneck in under 5 minutes.

Question 2

What are flame graphs and how do you generate them?

Accepted Answer

Flame graphs visualize where CPU time is spent. The x-axis is the sampled stack population (wider = more CPU), y-axis is stack depth. The widest plateau that does not delegate to children is your bottleneck. Generation: (1) Record: perf record -F 99 -p PID -g -- sleep 30. (2) Generate: perf script | stackcollapse-perf.pl | flamegraph.pl > profile.svg. (3) Open the interactive SVG in a browser. Language-specific tools: Java = async-profiler, Go = pprof (built-in), Python = py-spy (no instrumentation needed), Node.js = clinic.js. Flame graphs are the fastest way to answer why is this service using so much CPU? and should be the first step after top identifies the hot process. Created by Brendan Gregg at Netflix.

Question 3

How do you diagnose high disk I/O latency?

Accepted Answer

Use iostat -xz 1 to identify the problem: (1) %util near 100% -- the disk is saturated. Check r/s and w/s for the operation rate. Check r_await and w_await for latency (SSD should be < 1ms, HDD 5-15ms). If await is high with moderate IOPS, the disk is slow. Solutions: upgrade to NVMe SSD, add caching to reduce I/O, optimize database queries to reduce random reads. (2) Use iotop -oPa to identify which process generates the most I/O. (3) Use lsof -p PID to see which files the process is reading/writing. (4) Common culprits: database without enough buffer cache (every query hits disk), excessive logging (writing GBs of logs), backup jobs running during peak hours, and swap thrashing (memory pressure causing constant disk reads/writes).

Question 4

How do you detect and troubleshoot memory leaks in production?

Accepted Answer

Detection: monitor the process RSS (Resident Set Size) over time using top or a monitoring system. If RSS grows monotonically without plateau, there is a leak. The free -h command shows system-wide memory. If available memory steadily decreases, some process is leaking. Diagnosis: (1) Check vmstat 1 for swap activity (si/so non-zero means memory pressure is causing swapping). (2) Check dmesg for OOM killer messages (Out of memory: Kill process). (3) Language-specific profiling: Java -- take a heap dump with jmap, analyze with Eclipse MAT (look for dominator trees). Python -- use tracemalloc to track allocations. Go -- pprof heap profile (go tool pprof). Node.js -- --inspect flag + Chrome DevTools heap snapshot. Common leak causes: unbounded caches (maps that grow forever), unremoved event listeners, ThreadLocal values in thread pools, and unclosed database connections.

System Design: Linux Performance Tools — top, strace, perf, tcpdump, iostat, vmstat, Flame Graphs, Production Debugging

CPU Analysis

Memory Analysis

Disk I/O Analysis

Network Analysis

Flame Graphs

Production Debugging Playbook