When a production system is slow or unresponsive, you need to quickly identify the bottleneck: is it CPU, memory, disk I/O, network, or the application itself? Linux provides powerful performance tools for each resource. Knowing these tools is essential for SRE interviews, on-call troubleshooting, and any backend engineering role. This guide covers the tools every engineer should know, organized by the resource they diagnose.
CPU Analysis
top/htop: the first tool to run. Shows real-time CPU usage per process. Key columns: %CPU (CPU utilization), %MEM (memory), and state (S=sleeping, R=running, D=uninterruptible sleep — usually I/O wait). If one process is at 100% CPU, it is the likely bottleneck. If many processes are at high CPU, the system is overloaded. htop adds color coding, tree view (see parent-child process relationships), and per-core CPU utilization. Load average: shown in top/uptime. Three numbers: 1-minute, 5-minute, 15-minute averages. Load = number of processes in running + uninterruptible sleep states. On a 4-core system: load 4.0 = fully utilized, load 8.0 = overloaded (processes queuing). perf: the most powerful CPU profiler on Linux. perf top: real-time function-level CPU profiling (which functions consume the most CPU cycles). perf record + perf report: record a CPU profile over a period and analyze. perf stat: high-level CPU statistics (instructions per cycle, cache misses, branch mispredictions). Use perf when top shows high CPU but you need to know which function or code path is responsible.
Memory Analysis
free -h: shows total, used, free, buffer/cache, and available memory. The “available” column is what matters — it is the memory available for new allocations (free + reclaimable cache). If “available” is near zero, the system is under memory pressure and may start swapping. vmstat 1: reports memory, swap, I/O, and CPU statistics every 1 second. Key columns: si/so (swap in/out — non-zero means swapping, which kills performance), free (free memory), and cache (filesystem cache). If si/so are consistently non-zero, the system needs more RAM or a memory leak is consuming it. /proc/meminfo: detailed memory breakdown. Check: MemAvailable (usable memory), Buffers (filesystem metadata cache), Cached (file content cache), and SwapFree. OOM killer: when Linux runs out of memory, the OOM killer selects and kills a process. Check dmesg for “Out of memory: Kill process”. The killed process is chosen by an oom_score based on memory usage. For critical processes, set oom_score_adj = -1000 to prevent killing (but fix the actual memory issue). Memory leak detection: monitor RSS (Resident Set Size) of your process over time. If it grows monotonically, there is a leak. Use valgrind (C/C++), heap profiler (Java jmap), or tracemalloc (Python) to identify the leaking allocation.
Disk I/O Analysis
iostat -xz 1: the primary disk I/O tool. Key columns: r/s and w/s (reads/writes per second), r_await and w_await (average latency per read/write in milliseconds — for SSD, expect < 1ms; for HDD, expect 5-15ms), %util (percentage of time the device is busy — 100% means saturated). If %util is near 100% and await is high, the disk is the bottleneck. Solutions: move to faster storage (NVMe SSD), reduce I/O (add caching, optimize queries), or distribute I/O (RAID, multiple disks). iotop: like top but for disk I/O. Shows which processes are reading/writing the most. Run with iotop -oPa to see accumulated I/O per process. lsof: list open files. lsof -p PID shows all files a process has open. Useful for finding which files a process is reading/writing, detecting file descriptor leaks, and identifying lock files. df -h: check disk space. A full disk causes application failures (cannot write logs, cannot create temp files, database crashes). Monitor disk usage and alert at 80%.
Network Analysis
ss -tlnp: show listening TCP sockets (which ports are open and which process owns them). Replace the older netstat. ss -s shows connection statistics (total established, time-wait, close-wait connections). High TIME_WAIT: many short-lived connections being created and closed. Consider connection pooling. High CLOSE_WAIT: the application is not closing connections properly (socket leak). tcpdump: capture and inspect network packets. tcpdump -i eth0 port 5432: capture PostgreSQL traffic. tcpdump -i any host 10.0.1.5: capture all traffic to/from a specific host. Use with -w file.pcap to save for analysis in Wireshark. Useful for: debugging connection timeouts, verifying TLS handshakes, inspecting HTTP request/response content, and detecting network-level issues. curl with timing: curl -o /dev/null -s -w “dns: %{time_namelookup}s, connect: %{time_connect}s, tls: %{time_appconnect}s, first_byte: %{time_starttransfer}s, total: %{time_total}s” https://api.example.com. Breaks down latency into DNS, TCP connect, TLS, server processing (first byte), and total. Identifies whether latency is network or server.
Flame Graphs
Flame graphs are the most effective visualization for understanding where CPU time is spent. Created by Brendan Gregg (Netflix). A flame graph shows the call stack: the x-axis is the sampled stack population (wider = more CPU time), the y-axis is stack depth (callers below, callees above). Reading a flame graph: find the widest “plateau” (a function that consumes significant CPU and does not delegate to children). That function is the bottleneck. Generating a flame graph: (1) Record a CPU profile: perf record -F 99 -p PID -g — sleep 30 (sample at 99 Hz for 30 seconds). (2) Generate the flame graph: perf script | stackcollapse-perf.pl | flamegraph.pl > profile.svg. (3) Open profile.svg in a browser. The SVG is interactive: click to zoom into a subtree. For Java: use async-profiler (jattach to a running JVM, no restart needed). For Go: use pprof (built-in: go tool pprof http://localhost:6060/debug/pprof/profile). For Node.js: use –prof flag or clinic.js. For Python: use py-spy (samples a running Python process without instrumentation). Flame graphs are the fastest way to answer “why is this service slow?” and should be the first step in any CPU investigation after top identifies the hot process.
Production Debugging Playbook
When a service is degraded, follow this checklist: (1) uptime — check load average. If load >> CPU cores, the system is overloaded. (2) dmesg | tail — check for kernel errors (OOM kills, disk errors, network issues). (3) free -h — check available memory. Near zero = memory pressure. (4) vmstat 1 — check for swapping (si/so non-zero), CPU wait (wa > 5%), and run queue (r >> cores). (5) iostat -xz 1 — check disk utilization and latency. %util near 100% = disk bottleneck. (6) ss -s — check for connection issues (high TIME_WAIT, CLOSE_WAIT). (7) top/htop — identify the hot process. (8) strace -p PID -c — trace system calls for the hot process. Shows what the process is actually doing (reading files, waiting for network, sleeping). The -c flag summarizes by system call type. (9) If CPU is the bottleneck: generate a flame graph to identify the hot code path. (10) If I/O is the bottleneck: check iostat for which disk, iotop for which process, and lsof for which files. This systematic approach isolates the bottleneck in under 5 minutes. Practice it on a staging system so it is automatic during a real incident.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the first thing to check when a production system is slow?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Follow this systematic checklist: (1) uptime — check load average. If load >> CPU cores, system is overloaded. (2) dmesg | tail — kernel errors (OOM kills, disk errors). (3) free -h — check available memory. Near zero = memory pressure, likely swapping. (4) vmstat 1 — check si/so (swap activity, non-zero = bad), wa (I/O wait > 5% = disk bottleneck), r (run queue >> cores = CPU saturated). (5) iostat -xz 1 — disk utilization and latency. %util near 100% = disk bottleneck. (6) ss -s — connection issues (high TIME_WAIT or CLOSE_WAIT). (7) top/htop — identify the hot process. (8) If CPU bottleneck: generate a flame graph. If I/O: check iostat + iotop + lsof. This isolates the bottleneck in under 5 minutes.”}},{“@type”:”Question”,”name”:”What are flame graphs and how do you generate them?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Flame graphs visualize where CPU time is spent. The x-axis is the sampled stack population (wider = more CPU), y-axis is stack depth. The widest plateau that does not delegate to children is your bottleneck. Generation: (1) Record: perf record -F 99 -p PID -g — sleep 30. (2) Generate: perf script | stackcollapse-perf.pl | flamegraph.pl > profile.svg. (3) Open the interactive SVG in a browser. Language-specific tools: Java = async-profiler, Go = pprof (built-in), Python = py-spy (no instrumentation needed), Node.js = clinic.js. Flame graphs are the fastest way to answer why is this service using so much CPU? and should be the first step after top identifies the hot process. Created by Brendan Gregg at Netflix.”}},{“@type”:”Question”,”name”:”How do you diagnose high disk I/O latency?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use iostat -xz 1 to identify the problem: (1) %util near 100% — the disk is saturated. Check r/s and w/s for the operation rate. Check r_await and w_await for latency (SSD should be < 1ms, HDD 5-15ms). If await is high with moderate IOPS, the disk is slow. Solutions: upgrade to NVMe SSD, add caching to reduce I/O, optimize database queries to reduce random reads. (2) Use iotop -oPa to identify which process generates the most I/O. (3) Use lsof -p PID to see which files the process is reading/writing. (4) Common culprits: database without enough buffer cache (every query hits disk), excessive logging (writing GBs of logs), backup jobs running during peak hours, and swap thrashing (memory pressure causing constant disk reads/writes)."}},{"@type":"Question","name":"How do you detect and troubleshoot memory leaks in production?","acceptedAnswer":{"@type":"Answer","text":"Detection: monitor the process RSS (Resident Set Size) over time using top or a monitoring system. If RSS grows monotonically without plateau, there is a leak. The free -h command shows system-wide memory. If available memory steadily decreases, some process is leaking. Diagnosis: (1) Check vmstat 1 for swap activity (si/so non-zero means memory pressure is causing swapping). (2) Check dmesg for OOM killer messages (Out of memory: Kill process). (3) Language-specific profiling: Java — take a heap dump with jmap, analyze with Eclipse MAT (look for dominator trees). Python — use tracemalloc to track allocations. Go — pprof heap profile (go tool pprof). Node.js — –inspect flag + Chrome DevTools heap snapshot. Common leak causes: unbounded caches (maps that grow forever), unremoved event listeners, ThreadLocal values in thread pools, and unclosed database connections."}}]}