Low Level Design: PostgreSQL Internals

Process Architecture

PostgreSQL uses a multi-process architecture rather than threads. The postmaster is the root process: it listens for incoming connections and forks a new backend process for each client connection. Each backend process handles exactly one client session for its entire lifetime. This isolation means a crashing backend cannot corrupt the state of other backends — a deliberate design choice for stability.

All backend processes communicate through shared memory, which contains the buffer pool (page cache), WAL buffers, lock tables, and other shared state. Background processes run independently: autovacuum workers reclaim dead tuples, the WAL writer flushes WAL buffers to disk, the bgwriter writes dirty pages from the buffer pool, the checkpointer periodically forces all dirty pages to disk, and the stats collector gathers query and table statistics.

Heap File Storage

Each table is stored as a heap file — a collection of fixed-size 8KB pages (blocks). The page layout is:

  • Page header: LSN of the latest WAL record touching this page, free space pointers, flags.
  • Item array: an array of (offset, length) pairs at the start of the free space region, each pointing to a tuple within the page.
  • Free space: unused space between the item array (growing down) and tuples (growing up).
  • Tuples: stored from the end of the page upward, each with a header followed by attribute data.

The tuple header contains: xmin (transaction ID that inserted this tuple), xmax (transaction ID that deleted or updated this tuple, zero if live), ctid (physical location — block number + item offset — used for HOT updates), a null bitmap, and infomask flags. Attribute data follows immediately after the header, with alignment padding between variable-length fields.

MVCC: Multi-Version Concurrency Control

PostgreSQL implements MVCC by keeping multiple versions of each row. When a row is updated, the old tuple is not overwritten — instead, the old tuple’s xmax is set to the updating transaction’s ID, and a new tuple version is inserted with xmin set to the same transaction ID. Readers see the version appropriate to their snapshot; writers create new versions. No read locks are needed — readers never block writers and writers never block readers.

A snapshot is captured at the start of each transaction (or each statement in READ COMMITTED). It records: the current transaction ID (xmax boundary), the list of in-progress transaction IDs, and the minimum active transaction ID (xmin boundary). A tuple is visible to a snapshot if:

  • xmin is committed and was committed before the snapshot was taken, AND
  • xmax is either zero, not yet committed, or was committed after the snapshot was taken.

This means old tuple versions accumulate on disk until VACUUM removes them. Transaction IDs are 32-bit integers; PostgreSQL uses transaction ID wraparound prevention (via VACUUM FREEZE) to avoid ID space exhaustion.

Write-Ahead Log (WAL)

WAL is the backbone of PostgreSQL’s durability and replication. Before any change is applied to a data page, a WAL record describing that change is written to the WAL buffer and flushed to disk. On commit, the WAL record for the commit is fsynced — after that, the transaction is durable even if the server crashes before dirty data pages are written.

Each WAL record contains: a resource manager identifier (heap, index, sequence, etc.), a Log Sequence Number (LSN — byte offset in the WAL stream), and the change data (full page images on first write after checkpoint, or delta records thereafter). The WAL file is a sequence of 16MB segments in pg_wal/. WAL is used for: crash recovery (replay WAL from last checkpoint), streaming replication (WAL records streamed to standby servers in real time), and point-in-time recovery (PITR).

VACUUM and Dead Tuple Reclamation

MVCC creates dead tuple versions — rows whose xmax is committed and which are no longer visible to any active snapshot. VACUUM scans the heap, identifies dead tuples, marks their space as reusable, and updates the Free Space Map (FSM) and Visibility Map (VM). The VM tracks which pages contain only tuples visible to all transactions — index-only scans can skip heap fetches for VM-clean pages.

Autovacuum runs in the background and triggers VACUUM on a table when dead tuple count exceeds a threshold (autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples). Regular VACUUM reclaims space for reuse within the same file; it does not return space to the OS. VACUUM FULL rewrites the entire table to a new file, compacting it — but it holds an exclusive lock for the duration and is disruptive to production workloads. pg_repack is a popular extension that does online table compaction without an exclusive lock.

Buffer Pool

The buffer pool (shared_buffers) is PostgreSQL’s page cache in shared memory — a fixed array of 8KB page slots. When a backend needs a page, it first checks the buffer pool. On a miss, it loads the page from disk into a free slot. The eviction algorithm is a variant of clock sweep: a usage count is incremented on each access and decremented on each sweep pass; pages with zero usage count are evicted. This approximates LRU with low overhead.

Dirty pages (modified but not yet written to disk) are written by the bgwriter process continuously in the background, and by the checkpointer during checkpoints. A checkpoint guarantees that all dirty pages at that moment have been flushed — after a checkpoint, crash recovery only needs to replay WAL from the checkpoint LSN onward. Checkpoints are triggered by time (checkpoint_timeout) or WAL volume (max_wal_size).

Query Planner and Optimizer

PostgreSQL uses a cost-based optimizer. The planner reads statistics from pg_statistic (collected by ANALYZE): row counts, distinct value counts, most common values, histogram bounds, and correlation. From these it estimates the number of rows each plan node will process (cardinality estimation) and assigns a cost in arbitrary units (proportional to I/O and CPU time).

Key plan node types:

  • Sequential Scan: reads all pages of the table. Best for large fraction of rows.
  • Index Scan: uses an index to find matching TIDs, then fetches heap pages. Best for selective predicates.
  • Bitmap Heap Scan: builds a bitmap of matching pages from one or more indexes, then fetches pages in physical order. Good for moderate selectivity or OR conditions.
  • Hash Join: build hash table from smaller relation, probe with larger. Good for large equi-joins.
  • Merge Join: requires both inputs sorted on join key. Best when inputs are already sorted.
  • Nested Loop: for each row in outer, scan inner. Best for small outer sets with indexed inner.

EXPLAIN ANALYZE executes the query and shows actual row counts and timing alongside estimates — the primary tool for diagnosing plan quality and identifying cardinality estimation errors.

Index Types

PostgreSQL supports multiple index types for different query patterns:

  • B-tree (default): balanced tree supporting equality, range, IS NULL, LIKE 'prefix%'. Suitable for most use cases.
  • Hash: O(1) equality lookup. No range support. Rarely worth choosing over B-tree unless equality-only workloads on large keys.
  • GiST (Generalized Search Tree): extensible framework supporting geometric types, full-text search (tsvector), and custom operators. Used by PostGIS for spatial indexes.
  • GIN (Generalized Inverted Index): designed for composite values — JSONB keys, array elements, full-text search tokens. Stores a posting list per key, enabling fast containment queries (@>, @@).
  • BRIN (Block Range Index): extremely small index storing min/max values per range of heap blocks. Effective for very large, naturally ordered tables (e.g., time-series data with sequential inserts) where approximate range filtering eliminates most blocks.

Extensions

PostgreSQL’s extension system allows adding data types, index methods, functions, and operators without modifying the core. Key extensions used in production:

  • PostGIS: adds geometry/geography types, spatial indexes (GiST), and hundreds of spatial functions. The standard for geospatial data in PostgreSQL.
  • pg_trgm: trigram-based similarity for fuzzy text search and LIKE/ILIKE index acceleration via GiST or GIN indexes.
  • TimescaleDB: hypertable abstraction that automatically partitions time-series data into chunks, with query optimizations (chunk pruning, parallel chunk scans) and continuous aggregates. Runs entirely as a PostgreSQL extension — no separate process.
  • pgvector: vector similarity search for ML embeddings, supporting L2, cosine, and inner product distances with IVFFlat and HNSW indexes.
Scroll to Top