Question 1

Why does writing to a WAL before modifying data provide durability?

Accepted Answer

The WAL is an append-only sequential file that is fsynced to disk before the transaction returns success. Sequential disk writes are 10-100x faster than the random I/O required to update actual data structures (B-trees, heap files). If the database crashes before the data files are updated but after the WAL is fsynced, the WAL records survive and can be replayed on restart to re-apply the changes. This separates the performance concern (fast sequential WAL writes) from the durability concern (all committed changes are recoverable from the WAL), enabling both high throughput and ACID durability.

Question 2

What is group commit and how does it improve write throughput?

Accepted Answer

Each fsync takes 1-10ms on SSDs — independently fsyncing per transaction limits throughput to 100-1000 TPS. Group commit batches multiple concurrent transactions into a single fsync: when a transaction is ready to commit, it joins the commit queue. A leader grabs the current batch, writes all their WAL records in one sequential write, performs one fsync, and notifies all transactions in the batch. A single fsync shared among 100 concurrent transactions reduces per-transaction fsync cost by 100x. PostgreSQL implements group commit transparently. This is why PostgreSQL performance improves significantly at higher concurrency — more transactions share each fsync.

Question 3

How do checkpoints bound crash recovery time?

Accepted Answer

Without checkpoints, recovery requires replaying the entire WAL from the beginning — potentially hours of log. A checkpoint forces all dirty buffer pool pages to disk, then records the current WAL position. After the checkpoint, WAL records before the checkpoint LSN are no longer needed for recovery (their changes are in the data files). Recovery only replays WAL from the last checkpoint forward. PostgreSQL checkpoints run every checkpoint_timeout (default 5 minutes) or when checkpoint_completion_target of WAL is written. More frequent checkpoints reduce recovery time but increase write amplification (more frequent page flushes to disk).

Question 4

How is WAL used for database replication?

Accepted Answer

PostgreSQL streaming replication works by shipping WAL records from primary to replicas in real-time. The replica connects to the primary as a WAL receiver; the primary's WAL sender streams new WAL records as they are written. The replica applies these records to its own buffer pool, staying within milliseconds of the primary. Log Sequence Numbers (LSNs) identify the exact replication position — the primary knows which records each replica has confirmed. Logical replication decodes WAL records into row-level changes (INSERT/UPDATE/DELETE) for cross-version or cross-database replication. WAL retention is controlled by replication slots — the primary retains WAL until all slots have confirmed receipt.

Low Level Design: Write-Ahead Log (WAL) Design

WAL Write Path

Log Sequence Numbers (LSN)

Crash Recovery with ARIES

Checkpointing

WAL for Replication

Group Commit

WAL in Non-Database Systems