When a production system is slow or unresponsive, you need to quickly identify the bottleneck: is it CPU, memory, disk I/O, network, or the application itself? Linux provides powerful performance tools for each resource. Knowing these tools is essential for SRE interviews, on-call troubleshooting, and any backend engineering role. This guide covers the tools every engineer should know, organized by the resource they diagnose.
CPU Analysis
top/htop: the first tool to run. Shows real-time CPU usage per process. Key columns: %CPU (CPU utilization), %MEM (memory), and state (S=sleeping, R=running, D=uninterruptible sleep — usually I/O wait). If one process is at 100% CPU, it is the likely bottleneck. If many processes are at high CPU, the system is overloaded. htop adds color coding, tree view (see parent-child process relationships), and per-core CPU utilization. Load average: shown in top/uptime. Three numbers: 1-minute, 5-minute, 15-minute averages. Load = number of processes in running + uninterruptible sleep states. On a 4-core system: load 4.0 = fully utilized, load 8.0 = overloaded (processes queuing). perf: the most powerful CPU profiler on Linux. perf top: real-time function-level CPU profiling (which functions consume the most CPU cycles). perf record + perf report: record a CPU profile over a period and analyze. perf stat: high-level CPU statistics (instructions per cycle, cache misses, branch mispredictions). Use perf when top shows high CPU but you need to know which function or code path is responsible.
Memory Analysis
free -h: shows total, used, free, buffer/cache, and available memory. The “available” column is what matters — it is the memory available for new allocations (free + reclaimable cache). If “available” is near zero, the system is under memory pressure and may start swapping. vmstat 1: reports memory, swap, I/O, and CPU statistics every 1 second. Key columns: si/so (swap in/out — non-zero means swapping, which kills performance), free (free memory), and cache (filesystem cache). If si/so are consistently non-zero, the system needs more RAM or a memory leak is consuming it. /proc/meminfo: detailed memory breakdown. Check: MemAvailable (usable memory), Buffers (filesystem metadata cache), Cached (file content cache), and SwapFree. OOM killer: when Linux runs out of memory, the OOM killer selects and kills a process. Check dmesg for “Out of memory: Kill process”. The killed process is chosen by an oom_score based on memory usage. For critical processes, set oom_score_adj = -1000 to prevent killing (but fix the actual memory issue). Memory leak detection: monitor RSS (Resident Set Size) of your process over time. If it grows monotonically, there is a leak. Use valgrind (C/C++), heap profiler (Java jmap), or tracemalloc (Python) to identify the leaking allocation.
Disk I/O Analysis
iostat -xz 1: the primary disk I/O tool. Key columns: r/s and w/s (reads/writes per second), r_await and w_await (average latency per read/write in milliseconds — for SSD, expect < 1ms; for HDD, expect 5-15ms), %util (percentage of time the device is busy — 100% means saturated). If %util is near 100% and await is high, the disk is the bottleneck. Solutions: move to faster storage (NVMe SSD), reduce I/O (add caching, optimize queries), or distribute I/O (RAID, multiple disks). iotop: like top but for disk I/O. Shows which processes are reading/writing the most. Run with iotop -oPa to see accumulated I/O per process. lsof: list open files. lsof -p PID shows all files a process has open. Useful for finding which files a process is reading/writing, detecting file descriptor leaks, and identifying lock files. df -h: check disk space. A full disk causes application failures (cannot write logs, cannot create temp files, database crashes). Monitor disk usage and alert at 80%.
Network Analysis
ss -tlnp: show listening TCP sockets (which ports are open and which process owns them). Replace the older netstat. ss -s shows connection statistics (total established, time-wait, close-wait connections). High TIME_WAIT: many short-lived connections being created and closed. Consider connection pooling. High CLOSE_WAIT: the application is not closing connections properly (socket leak). tcpdump: capture and inspect network packets. tcpdump -i eth0 port 5432: capture PostgreSQL traffic. tcpdump -i any host 10.0.1.5: capture all traffic to/from a specific host. Use with -w file.pcap to save for analysis in Wireshark. Useful for: debugging connection timeouts, verifying TLS handshakes, inspecting HTTP request/response content, and detecting network-level issues. curl with timing: curl -o /dev/null -s -w “dns: %{time_namelookup}s, connect: %{time_connect}s, tls: %{time_appconnect}s, first_byte: %{time_starttransfer}s, total: %{time_total}s” https://api.example.com. Breaks down latency into DNS, TCP connect, TLS, server processing (first byte), and total. Identifies whether latency is network or server.
Flame Graphs
Flame graphs are the most effective visualization for understanding where CPU time is spent. Created by Brendan Gregg (Netflix). A flame graph shows the call stack: the x-axis is the sampled stack population (wider = more CPU time), the y-axis is stack depth (callers below, callees above). Reading a flame graph: find the widest “plateau” (a function that consumes significant CPU and does not delegate to children). That function is the bottleneck. Generating a flame graph: (1) Record a CPU profile: perf record -F 99 -p PID -g — sleep 30 (sample at 99 Hz for 30 seconds). (2) Generate the flame graph: perf script | stackcollapse-perf.pl | flamegraph.pl > profile.svg. (3) Open profile.svg in a browser. The SVG is interactive: click to zoom into a subtree. For Java: use async-profiler (jattach to a running JVM, no restart needed). For Go: use pprof (built-in: go tool pprof http://localhost:6060/debug/pprof/profile). For Node.js: use –prof flag or clinic.js. For Python: use py-spy (samples a running Python process without instrumentation). Flame graphs are the fastest way to answer “why is this service slow?” and should be the first step in any CPU investigation after top identifies the hot process.
Production Debugging Playbook
When a service is degraded, follow this checklist: (1) uptime — check load average. If load >> CPU cores, the system is overloaded. (2) dmesg | tail — check for kernel errors (OOM kills, disk errors, network issues). (3) free -h — check available memory. Near zero = memory pressure. (4) vmstat 1 — check for swapping (si/so non-zero), CPU wait (wa > 5%), and run queue (r >> cores). (5) iostat -xz 1 — check disk utilization and latency. %util near 100% = disk bottleneck. (6) ss -s — check for connection issues (high TIME_WAIT, CLOSE_WAIT). (7) top/htop — identify the hot process. (8) strace -p PID -c — trace system calls for the hot process. Shows what the process is actually doing (reading files, waiting for network, sleeping). The -c flag summarizes by system call type. (9) If CPU is the bottleneck: generate a flame graph to identify the hot code path. (10) If I/O is the bottleneck: check iostat for which disk, iotop for which process, and lsof for which files. This systematic approach isolates the bottleneck in under 5 minutes. Practice it on a staging system so it is automatic during a real incident.