Low Level Design: Write-Ahead Log Internals

⏱ 3 min read

A Write-Ahead Log (WAL) is the durability mechanism at the heart of most databases and storage systems. Before any data modification is applied to the actual data pages, the change is first recorded in the WAL. On crash recovery, the database replays the WAL to reconstruct the committed state. Understanding WAL internals explains how databases achieve ACID durability and support features like replication and point-in-time recovery.

WAL Write Sequence

A transaction write follows: (1) generate WAL record (operation type, table, row before-image, row after-image, transaction ID, LSN); (2) append WAL record to WAL buffer (in memory); (3) flush WAL buffer to WAL file on disk (fsync); (4) apply change to data buffer pool (in memory); (5) respond to client with commit confirmation. The critical step is step 3 before step 5: durability requires the WAL is on disk before confirming commit. The data page may still be in the buffer pool at commit time; it is written later during checkpointing.

Log Sequence Number (LSN)

Every WAL record has a monotonically increasing Log Sequence Number (LSN). The LSN identifies the position in the WAL stream. Data pages record the LSN of the last WAL record that modified them. During recovery, pages with an LSN less than the redo point are up-to-date; pages with a lower LSN need WAL replay. In PostgreSQL, the LSN is a 64-bit offset into the WAL stream. Replication uses LSNs to track replication lag: the primary's current LSN minus the replica's confirmed LSN gives the replication lag in bytes.

Checkpointing

Checkpointing writes all dirty data pages from the buffer pool to disk and records a checkpoint LSN in the WAL. During crash recovery, replay starts from the most recent checkpoint, not from the beginning of the WAL — this bounds recovery time. Without checkpointing, the WAL would grow unbounded and crash recovery would replay the entire history. In PostgreSQL, checkpoint_completion_target and checkpoint_timeout control checkpoint frequency. More frequent checkpoints reduce recovery time but increase I/O during normal operation.

Crash Recovery (ARIES)

The ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) recovery algorithm has three phases: Analysis (scan the WAL from the last checkpoint to determine which transactions were active at crash time), Redo (replay all operations from the checkpoint forward, including uncommitted transactions — restores the database to the exact state at crash), Undo (roll back all uncommitted transactions by applying compensation log records — CLRs). ARIES is the foundation of recovery in PostgreSQL, MySQL InnoDB, and SQL Server.

WAL-Based Replication

Physical (streaming) replication ships WAL records from primary to replica. The replica applies WAL records to its own data files, maintaining an identical copy of the primary. In PostgreSQL, the Walsender process on the primary streams WAL to the Walreceiver on the replica. Synchronous replication waits for the replica to acknowledge WAL receipt before confirming commit to the client — guaranteeing zero data loss failover. Asynchronous replication confirms commit immediately; replica may lag by seconds on failover.

Point-in-Time Recovery (PITR)

PITR allows restoring the database to any point in time by replaying WAL on top of a base backup. Workflow: take a base backup (full copy of the data directory), then continuously archive WAL segments to object storage (S3). To restore to time T: restore the base backup, then replay archived WAL segments up to T. This enables recovery from accidental data deletion: restore to 5 minutes before the deletion query ran. WAL archiving is the continuous component; the base backup frequency determines how far back the archive window extends.

WAL in Log-Structured Systems

LSM-based databases (RocksDB, Cassandra, LevelDB) use a WAL differently. Writes go to the WAL and an in-memory MemTable. When the MemTable is full, it is flushed to an immutable SSTable on disk. The WAL ensures durability for unflushed MemTable data. After a MemTable flush, the corresponding WAL segments can be deleted. Recovery: replay WAL to rebuild the in-memory MemTable state that was not yet flushed to SSTables at the time of crash. The WAL in LSM systems is typically simpler than in page-based databases because it only covers the MemTable, not full page-level undo.