PostgreSQL is the most popular open-source relational database, powering applications from startups to enterprises. Understanding its internals — how MVCC provides transaction isolation, why VACUUM is necessary, how the WAL ensures durability, and how to tune performance — separates senior engineers from juniors. This guide covers the internals that matter for production operations and interviews.
MVCC: Multi-Version Concurrency Control
PostgreSQL uses MVCC to allow concurrent readers and writers without blocking each other. Instead of locking rows for reads, each transaction sees a snapshot of the database at the time it started. Writers create new row versions; readers see the version visible to their snapshot. Implementation: each row has hidden system columns: xmin (the transaction ID that created this version), xmax (the transaction ID that deleted/updated this version, 0 if still live), and ctid (physical location on disk). When a row is updated: PostgreSQL does NOT modify the existing row in place. It creates a new version of the row (with the updated values and a new xmin) and marks the old version with xmax = the updating transaction ID. Both versions exist on disk simultaneously. A reader with a snapshot older than the update sees the old version (xmax is invisible to it). A reader with a newer snapshot sees the new version. This means: readers never block writers, writers never block readers, and the only conflict is two writers updating the same row (one waits for the other to commit or abort). The cost: dead rows (old versions) accumulate on disk and must be cleaned up by VACUUM.
VACUUM: Dead Row Cleanup
VACUUM reclaims space occupied by dead rows (old MVCC versions that are no longer visible to any transaction). Without VACUUM, the database grows indefinitely as updates and deletes create dead rows that are never removed. This is called table bloat. Autovacuum: PostgreSQL runs VACUUM automatically based on configurable thresholds. The default: vacuum when dead rows exceed 20% of the table + 50 base threshold. For a 1-million-row table: vacuum after approximately 200,050 dead rows accumulate. Autovacuum runs in the background without blocking queries. VACUUM phases: (1) Scan the table for dead rows (rows with xmax from committed transactions that are no longer visible to any active transaction). (2) Mark dead rows as available for reuse (the space can be filled by future inserts). (3) Update the visibility map (tracks which pages contain only live rows — enables Index Only Scans). VACUUM FULL: rewrites the entire table to a new file, compacting it and reclaiming disk space to the OS. Requires an exclusive lock (blocks all reads and writes). Use only when table bloat is severe and a maintenance window is available. For routine maintenance, regular VACUUM (not FULL) is sufficient.
WAL: Write-Ahead Log
The Write-Ahead Log guarantees durability: every change is written to the WAL before modifying the actual data pages. If PostgreSQL crashes, it replays the WAL on startup to recover committed transactions. WAL flow: (1) A transaction modifies data in shared memory (buffer cache). (2) The change is recorded in the WAL buffer (in memory). (3) On COMMIT, the WAL buffer is flushed to disk (fsync). The commit returns to the client only after the WAL is safely on disk. (4) The dirty data pages in the buffer cache are written to disk later by the background writer or checkpointer (asynchronously). This design is fast because: WAL writes are sequential (append-only log — the fastest I/O pattern). Data page writes are deferred (the background writer batches them). Only the WAL must be fsynced at commit time (one small sequential write), not the entire modified data. WAL is also used for: streaming replication (replicas read the WAL stream and apply changes), point-in-time recovery (PITR — replay WAL from a base backup to any point in time), and logical replication (decode WAL entries into logical changes for CDC with Debezium). WAL sizing: wal_level (replica for physical replication, logical for CDC), max_wal_size (checkpoint trigger), and wal_compression (reduces WAL volume by 50%+).
TOAST: Large Value Storage
PostgreSQL pages are 8 KB. A row that exceeds approximately 2 KB triggers TOAST (The Oversized-Attribute Storage Technique). TOAST compresses and/or moves large values to a separate TOAST table. Strategies: (1) EXTENDED (default for text, jsonb) — first try compression. If still too large, move to the TOAST table. The main table stores a pointer. (2) EXTERNAL — move to TOAST table without compression. Use for pre-compressed data (images). (3) MAIN — try compression only; do not move to TOAST table if possible. (4) PLAIN — no compression, no TOAST. Only for fixed-size types that always fit in a page. Impact: TOAST is transparent to queries — PostgreSQL automatically decompresses and reassembles values. But TOAST adds overhead: reading a TOASTed column requires an additional I/O to the TOAST table. For tables with large jsonb columns, queries that SELECT * read the TOAST table for every row, even if the jsonb column is not used in the result. SELECT only the columns you need to avoid unnecessary TOAST reads.
Performance Tuning Essentials
Key configuration parameters: (1) shared_buffers — the main memory cache for data pages. Set to 25% of total RAM (e.g., 4 GB for a 16 GB server). Too small: excessive disk reads. Too large: reduces memory available for the OS page cache. (2) work_mem — memory per operation for sorts, hash joins, and aggregations. Default 4 MB is low. Set to 64-256 MB for analytical workloads (but multiply by max_connections to estimate total memory: 256 MB * 100 connections = 25 GB). (3) effective_cache_size — tells the query planner how much total cache is available (shared_buffers + OS page cache). Set to 75% of total RAM. This does not allocate memory; it helps the planner choose index scans over sequential scans when data is likely cached. (4) maintenance_work_mem — memory for VACUUM, CREATE INDEX, and other maintenance operations. Set to 1-2 GB. Higher values speed up these operations. (5) random_page_cost — the planner estimated cost of a random disk read relative to a sequential read. Default 4.0 (HDD). Set to 1.1-1.5 for SSD (random and sequential reads are similar speed on SSD). This makes the planner more willing to use index scans. (6) max_connections — default 100. For applications using connection pooling (PgBouncer), 100-200 is sufficient. Higher values waste memory (each connection uses 5-10 MB).
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does PostgreSQL MVCC allow concurrent reads and writes without blocking?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”MVCC (Multi-Version Concurrency Control) keeps multiple versions of each row. Each version has xmin (creating transaction ID) and xmax (deleting transaction ID). When a row is updated, PostgreSQL creates a NEW version (new xmin) and marks the old version (sets xmax). Both exist on disk simultaneously. Readers see a snapshot of the database at their transaction start time. A reader sees the row version where xmin is committed and visible to their snapshot, and xmax is either 0 (not deleted) or not yet visible. Writers create new versions without touching old ones. Result: readers never block writers, writers never block readers. The only conflict is two writers updating the same row (one waits). Cost: dead rows (old versions) accumulate and must be cleaned by VACUUM.”}},{“@type”:”Question”,”name”:”Why is VACUUM necessary and how does autovacuum work?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”MVCC creates dead rows (old row versions) on every UPDATE and DELETE. Without cleanup, the table grows indefinitely — this is table bloat. VACUUM reclaims space from dead rows. It scans for rows with committed xmax that are invisible to all active transactions, marks them as available for reuse, and updates the visibility map (enabling Index Only Scans). Autovacuum runs automatically based on thresholds: default triggers when dead rows exceed 20% of the table + 50 rows. For a 1M-row table: vacuum after ~200K dead rows. Autovacuum runs in the background without blocking queries. VACUUM FULL rewrites the entire table (compacts it) but requires an exclusive lock — use only during maintenance windows for severe bloat. For routine operation, regular autovacuum is sufficient. Monitor pg_stat_user_tables.n_dead_tup to track dead row accumulation.”}},{“@type”:”Question”,”name”:”How does the Write-Ahead Log (WAL) ensure durability?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”WAL guarantees that committed transactions survive crashes. Flow: (1) Transaction modifies data in shared memory (buffer cache). (2) The change is recorded in the WAL buffer. (3) On COMMIT, the WAL buffer is fsynced to disk. The commit returns only after WAL is safely on disk. (4) Dirty data pages are written to disk later by the background writer (asynchronously). This is fast because WAL writes are sequential (append-only — fastest I/O pattern) and only the small WAL entry must be fsynced at commit, not the entire data page. On crash recovery: PostgreSQL replays the WAL from the last checkpoint, restoring all committed transactions. WAL also powers: streaming replication (replicas apply WAL entries), point-in-time recovery (replay to any timestamp), and logical replication/CDC (Debezium reads the WAL for change events).”}},{“@type”:”Question”,”name”:”What are the most important PostgreSQL performance tuning parameters?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Six essential parameters: (1) shared_buffers = 25% of RAM (data page cache). (2) work_mem = 64-256 MB (memory per sort/hash operation — multiply by max_connections for total). (3) effective_cache_size = 75% of RAM (tells the planner total cache available — does not allocate memory). (4) maintenance_work_mem = 1-2 GB (memory for VACUUM and CREATE INDEX). (5) random_page_cost = 1.1-1.5 for SSD (default 4.0 is for HDD — too high for SSD causes the planner to avoid index scans). (6) max_connections = 100-200 with connection pooling (each connection uses 5-10 MB — high values waste memory). The biggest quick wins: set random_page_cost correctly for SSD (makes the planner use indexes), increase shared_buffers from the default 128 MB, and use PgBouncer for connection pooling instead of increasing max_connections.”}}]}