System Design: PostgreSQL Internals — MVCC, Vacuum, WAL, TOAST, Connection Handling, Performance Tuning

PostgreSQL is the most popular open-source relational database, powering applications from startups to enterprises. Understanding its internals — how MVCC provides transaction isolation, why VACUUM is necessary, how the WAL ensures durability, and how to tune performance — separates senior engineers from juniors. This guide covers the internals that matter for production operations and interviews.

MVCC: Multi-Version Concurrency Control

PostgreSQL uses MVCC to allow concurrent readers and writers without blocking each other. Instead of locking rows for reads, each transaction sees a snapshot of the database at the time it started. Writers create new row versions; readers see the version visible to their snapshot. Implementation: each row has hidden system columns: xmin (the transaction ID that created this version), xmax (the transaction ID that deleted/updated this version, 0 if still live), and ctid (physical location on disk). When a row is updated: PostgreSQL does NOT modify the existing row in place. It creates a new version of the row (with the updated values and a new xmin) and marks the old version with xmax = the updating transaction ID. Both versions exist on disk simultaneously. A reader with a snapshot older than the update sees the old version (xmax is invisible to it). A reader with a newer snapshot sees the new version. This means: readers never block writers, writers never block readers, and the only conflict is two writers updating the same row (one waits for the other to commit or abort). The cost: dead rows (old versions) accumulate on disk and must be cleaned up by VACUUM.

VACUUM: Dead Row Cleanup

VACUUM reclaims space occupied by dead rows (old MVCC versions that are no longer visible to any transaction). Without VACUUM, the database grows indefinitely as updates and deletes create dead rows that are never removed. This is called table bloat. Autovacuum: PostgreSQL runs VACUUM automatically based on configurable thresholds. The default: vacuum when dead rows exceed 20% of the table + 50 base threshold. For a 1-million-row table: vacuum after approximately 200,050 dead rows accumulate. Autovacuum runs in the background without blocking queries. VACUUM phases: (1) Scan the table for dead rows (rows with xmax from committed transactions that are no longer visible to any active transaction). (2) Mark dead rows as available for reuse (the space can be filled by future inserts). (3) Update the visibility map (tracks which pages contain only live rows — enables Index Only Scans). VACUUM FULL: rewrites the entire table to a new file, compacting it and reclaiming disk space to the OS. Requires an exclusive lock (blocks all reads and writes). Use only when table bloat is severe and a maintenance window is available. For routine maintenance, regular VACUUM (not FULL) is sufficient.

WAL: Write-Ahead Log

The Write-Ahead Log guarantees durability: every change is written to the WAL before modifying the actual data pages. If PostgreSQL crashes, it replays the WAL on startup to recover committed transactions. WAL flow: (1) A transaction modifies data in shared memory (buffer cache). (2) The change is recorded in the WAL buffer (in memory). (3) On COMMIT, the WAL buffer is flushed to disk (fsync). The commit returns to the client only after the WAL is safely on disk. (4) The dirty data pages in the buffer cache are written to disk later by the background writer or checkpointer (asynchronously). This design is fast because: WAL writes are sequential (append-only log — the fastest I/O pattern). Data page writes are deferred (the background writer batches them). Only the WAL must be fsynced at commit time (one small sequential write), not the entire modified data. WAL is also used for: streaming replication (replicas read the WAL stream and apply changes), point-in-time recovery (PITR — replay WAL from a base backup to any point in time), and logical replication (decode WAL entries into logical changes for CDC with Debezium). WAL sizing: wal_level (replica for physical replication, logical for CDC), max_wal_size (checkpoint trigger), and wal_compression (reduces WAL volume by 50%+).

TOAST: Large Value Storage

PostgreSQL pages are 8 KB. A row that exceeds approximately 2 KB triggers TOAST (The Oversized-Attribute Storage Technique). TOAST compresses and/or moves large values to a separate TOAST table. Strategies: (1) EXTENDED (default for text, jsonb) — first try compression. If still too large, move to the TOAST table. The main table stores a pointer. (2) EXTERNAL — move to TOAST table without compression. Use for pre-compressed data (images). (3) MAIN — try compression only; do not move to TOAST table if possible. (4) PLAIN — no compression, no TOAST. Only for fixed-size types that always fit in a page. Impact: TOAST is transparent to queries — PostgreSQL automatically decompresses and reassembles values. But TOAST adds overhead: reading a TOASTed column requires an additional I/O to the TOAST table. For tables with large jsonb columns, queries that SELECT * read the TOAST table for every row, even if the jsonb column is not used in the result. SELECT only the columns you need to avoid unnecessary TOAST reads.

Performance Tuning Essentials

Key configuration parameters: (1) shared_buffers — the main memory cache for data pages. Set to 25% of total RAM (e.g., 4 GB for a 16 GB server). Too small: excessive disk reads. Too large: reduces memory available for the OS page cache. (2) work_mem — memory per operation for sorts, hash joins, and aggregations. Default 4 MB is low. Set to 64-256 MB for analytical workloads (but multiply by max_connections to estimate total memory: 256 MB * 100 connections = 25 GB). (3) effective_cache_size — tells the query planner how much total cache is available (shared_buffers + OS page cache). Set to 75% of total RAM. This does not allocate memory; it helps the planner choose index scans over sequential scans when data is likely cached. (4) maintenance_work_mem — memory for VACUUM, CREATE INDEX, and other maintenance operations. Set to 1-2 GB. Higher values speed up these operations. (5) random_page_cost — the planner estimated cost of a random disk read relative to a sequential read. Default 4.0 (HDD). Set to 1.1-1.5 for SSD (random and sequential reads are similar speed on SSD). This makes the planner more willing to use index scans. (6) max_connections — default 100. For applications using connection pooling (PgBouncer), 100-200 is sufficient. Higher values waste memory (each connection uses 5-10 MB).

Scroll to Top