Low Level Design: Write-Ahead Log (WAL) Design

The Write-Ahead Log (WAL) is the foundation of database durability and crash recovery. Every change to database state is first written to the WAL (a sequential append-only log on disk) before being applied to the actual data structures. If the database crashes mid-operation, the WAL is replayed on restart to bring the database to a consistent state. WAL enables ACID durability without requiring every write to flush all affected data structures to disk.

WAL Write Path

When a transaction commits: (1) Write all changes to the WAL as log records (BEGIN, individual data changes, COMMIT). (2) fsync the WAL to disk (flush OS write buffers — ensures the log survives a crash). (3) Return success to the client. (4) Apply the changes to the in-memory buffer pool (and eventually to the data files). The key insight: the WAL is sequential (append-only) while data file writes are random — sequential I/O is 10-100x faster than random I/O on spinning disks, and much faster even on SSDs. WAL turns all writes into sequential operations, dramatically increasing write throughput.

Log Sequence Numbers (LSN)

Every WAL record is assigned a Log Sequence Number (LSN) — a monotonically increasing offset into the WAL file. The LSN identifies the exact position in the WAL. Key uses: the WAL flushed up to LSN X means all records up to X are durable; the buffer pool page header stores the recovery LSN (the WAL position that was current when this page was last flushed to disk — used to determine which WAL records are needed for recovery); replication uses LSN to identify which WAL records the replica has received. The LSN is the universal “position in time” reference in PostgreSQL and similar systems.

Crash Recovery with ARIES

ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is the standard WAL-based recovery algorithm (used in PostgreSQL, SQL Server, IBM DB2). Recovery phases: Analysis — scan the WAL from the last checkpoint to the end; identify which transactions were in-progress and which pages were dirty at the crash. Redo — replay all WAL records from the redo point; re-apply all committed and in-progress changes (ensures all committed changes are applied). Undo — reverse the changes from all in-progress transactions (those that had not committed at crash time). After undo, the database is in a consistent state with all committed transactions applied and all uncommitted transactions rolled back.

Checkpointing

Without checkpoints, crash recovery requires replaying the entire WAL from the beginning — potentially gigabytes or terabytes of log. A checkpoint periodically: writes all dirty buffer pool pages to disk, records the current WAL position (checkpoint LSN) in a special checkpoint record, flushes the checkpoint record to the WAL. After the checkpoint, WAL records before the checkpoint LSN are no longer needed for recovery (all their changes are in the data files). Checkpoints bound recovery time: recovery only needs to replay WAL from the last checkpoint. PostgreSQL checkpoints run every checkpoint_timeout seconds (default 5 minutes) or when checkpoint_completion_target of WAL is generated. Frequent checkpoints reduce recovery time but increase write amplification.

WAL for Replication

WAL-based replication streams the WAL from primary to replicas. PostgreSQL streaming replication: the replica connects to the primary as a WAL receiver; the WAL sender on the primary streams WAL records in real-time. The replica applies WAL records to its own buffer pool, staying in sync with the primary. Logical replication decodes WAL records into row-level changes (INSERT/UPDATE/DELETE) and streams them — allows replicating to different schemas or even different database systems (via logical replication slots). WAL retention: the primary retains WAL records until all replicas have confirmed receipt. wal_keep_size (PostgreSQL) controls minimum WAL retention. Replication slots prevent WAL deletion for slow replicas.

Group Commit

fsync is expensive: flushing the WAL to disk takes 1-10ms on SSDs. If each transaction fsyncs independently, maximum write throughput is 100-1000 transactions per second (1 / fsync_latency). Group commit batches multiple transactions into a single fsync: when a transaction is ready to commit, it joins the commit queue; a leader grabs the current batch, writes all their WAL records, performs one fsync, and notifies all transactions in the batch that they are committed. A single fsync amortized across 100 transactions reduces the per-transaction cost by 100x. PostgreSQL implements group commit transparently; Kafka uses a similar flush mechanism for producer acknowledgments.

WAL in Non-Database Systems

WAL is not limited to databases. Kafka: each partition is an append-only log (essentially a WAL). Producers append records; consumers read from any offset. The log is the data structure, not a recovery mechanism. etcd: all writes go to the Raft log (which is a WAL) before being applied to the state machine — provides exactly-once semantics for distributed operations. RocksDB: the MemTable (in-memory write buffer) is backed by a WAL; on crash, the WAL is replayed to rebuild the MemTable. Redis AOF (Append-Only File): every write command is appended to the AOF — on restart, Redis replays the AOF to rebuild in-memory state. The pattern is universal: append to a durable sequential log before modifying in-memory state, enabling crash recovery.

Scroll to Top