What Is a Write-Ahead Log and Why Does It Exist?
A write-ahead log (WAL) is the foundation of durability in every serious database engine. Before any change is applied to a data page, a record describing that change is appended to the log and flushed to durable storage. Only after the log record is safely on disk does the engine acknowledge the commit to the client. This single rule — log before apply — is what lets a database survive crashes without losing committed data.
WAL serves three distinct purposes: durability (committed transactions survive crashes), recoverability (the engine can reconstruct any state by replaying log records), and replication (standbys can apply the same log stream the primary generates). Understanding the internals of WAL design reveals why these guarantees are achievable at high throughput.
Log Record Format and LSN
Every WAL record carries a Log Sequence Number (LSN) — a monotonically increasing 64-bit integer that identifies the record's position in the log stream. LSNs are used everywhere: to identify checkpoint positions, to track how far a standby has applied, and to determine which records need replay on recovery.
A minimal WAL record contains:
- lsn — position in the log byte stream (BIGINT)
- transaction_id — which transaction produced this change
- record_type — INSERT, UPDATE, DELETE, COMMIT, ABORT, CHECKPOINT
- relation_oid — which table or index was modified
- block_number — which page within the relation
- before_image — page content before the change (for undo)
- after_image — page content after the change (for redo)
- checksum — CRC32 or similar to detect torn writes
Physical WAL records a full page image on the first modification after a checkpoint (full-page writes) to protect against torn page writes when a system crashes mid-write on a 4KB OS page that does not align with an 8KB database page.
Sequential Write Performance
WAL is written sequentially — every new record is appended to the end of the current log segment. This matters enormously for throughput. Sequential disk writes saturate bandwidth; random writes are limited by seek latency on spinning disks and write amplification on SSDs. Because data page modifications are random (a write might touch page 1, page 50000, page 3 in sequence), the WAL absorbs the write traffic as sequential I/O and defers the random data-page writes to checkpoint time.
Log segments are typically fixed-size files (16 MB in PostgreSQL, 64 MB in MySQL InnoDB). When a segment fills, the engine opens the next one. Older segments are either archived or recycled.
fsync and Group Commit
Durability requires that WAL bytes reach durable storage — not just the OS page cache — before acknowledging a commit. This is achieved with fsync() or fdatasync(). The cost of one fsync per transaction limits throughput on high-latency storage.
Group commit batches multiple concurrent transactions into a single fsync call. Transactions that arrive while an fsync is in flight are queued; when the fsync completes, all queued transactions are acknowledged simultaneously. This trades individual commit latency for throughput — a single 5ms fsync can acknowledge hundreds of commits that stacked up while waiting. Group commit is the primary mechanism databases use to sustain thousands of write TPS on spinning disks.
Checkpoint Protocol
Without checkpoints, recovery would require replaying the entire log from the beginning of time. A checkpoint periodically flushes all dirty data pages from the buffer pool to disk and writes a checkpoint record to the WAL. Recovery only needs to replay log records from the most recent checkpoint LSN forward.
The checkpoint record stores the redo LSN — the earliest LSN from which replay must begin, which is the oldest LSN of any dirty page that was in the buffer pool when the checkpoint started.
Two checkpoint types appear in real systems:
- Full checkpoint — flush all dirty pages. Safe and clean, but causes a large I/O spike.
- Incremental/fast checkpoint — flush dirty pages gradually over a time window, spreading the I/O load. PostgreSQL's
checkpoint_completion_target(default 0.9) spreads the flush over 90% of the checkpoint interval.
Checkpoint frequency is a tradeoff: more frequent checkpoints reduce recovery time but increase background I/O. Less frequent checkpoints allow longer recovery windows after a crash.
Crash Recovery via Log Replay
On restart after a crash, the engine performs ARIES-style recovery in three phases:
- Analysis — scan the log from the last checkpoint to reconstruct which transactions were active at crash time and which pages were dirty.
- Redo — replay all log records from the redo LSN forward, applying both committed and in-progress transaction changes to bring pages to their state at the crash point.
- Undo — roll back all transactions that were active at crash time (no COMMIT record found), using before-images to reverse their changes.
The redo phase is idempotent — applying a WAL record to a page that already reflects that change (because the data page was flushed before the crash) produces no incorrect result, because each record carries the LSN of the page it modified and the engine skips records whose LSN is already reflected in the page header.
Log Shipping and Physical Replication
Log shipping is the simplest replication model: completed WAL segment files are copied from the primary to the standby (via rsync, S3, or a custom archiver). The standby applies segments in order. The replication lag equals at least one full segment (16 MB default), which can represent seconds of data.
Streaming replication eliminates segment-boundary lag by streaming WAL records over a TCP connection as they are generated. The standby's receiver process writes records to a local WAL buffer and signals the startup process to apply them. The primary tracks each standby's flush_lsn and apply_lsn for lag monitoring.
WAL archiving copies completed segments to S3 or NFS for point-in-time recovery (PITR). Combined with a base backup, a full archive allows restoring the database to any LSN in the archive window.
SQL Data Model
-- WAL record storage (typically in the WAL files themselves; this is a logical representation)
CREATE TABLE WALRecord (
lsn BIGINT PRIMARY KEY,
transaction_id BIGINT NOT NULL,
record_type VARCHAR(32) NOT NULL, -- INSERT, UPDATE, DELETE, COMMIT, ABORT, CHECKPOINT
relation_oid OID,
block_number BIGINT,
before_image BYTEA,
after_image BYTEA,
checksum INT NOT NULL,
recorded_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Checkpoint tracking
CREATE TABLE CheckpointRecord (
lsn BIGINT PRIMARY KEY,
redo_lsn BIGINT NOT NULL, -- earliest LSN needed for recovery
dirty_page_count INT NOT NULL,
checkpoint_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Standby replication state
CREATE TABLE WALShipState (
standby_id UUID PRIMARY KEY,
apply_lsn BIGINT NOT NULL, -- last LSN applied
flush_lsn BIGINT NOT NULL, -- last LSN flushed to standby disk
last_seen_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_walrecord_txn ON WALRecord(transaction_id);
CREATE INDEX idx_walshipstate_apply ON WALShipState(apply_lsn);
Python Implementation Sketch
import struct, hashlib, time, io
from dataclasses import dataclass, field
from typing import Optional
WAL_SEGMENT_SIZE = 16 * 1024 * 1024 # 16 MB
@dataclass
class WALRecord:
lsn: int
transaction_id: int
record_type: str
relation_oid: Optional[int]
block_number: Optional[int]
before_image: bytes
after_image: bytes
def serialize(self) -> bytes:
data = f"{self.lsn}|{self.transaction_id}|{self.record_type}".encode()
checksum = struct.pack(">I", int(hashlib.crc32(data) & 0xFFFFFFFF))
return data + checksum
class WALWriter:
def __init__(self, path: str):
self.path = path
self.current_lsn = 0
self.buffer: list[WALRecord] = []
self.fh = open(path, "ab")
def append_wal_record(self, txn_id: int, record_type: str,
relation: Optional[int], before: bytes, after: bytes) -> int:
self.current_lsn += 1
rec = WALRecord(self.current_lsn, txn_id, record_type,
relation, None, before, after)
self.buffer.append(rec)
return self.current_lsn
def flush_wal(self, group_commit: bool = True) -> int:
"""Write buffer to disk and fsync. Returns highest flushed LSN."""
if not self.buffer:
return self.current_lsn
for rec in self.buffer:
self.fh.write(rec.serialize())
self.fh.flush()
if not group_commit:
import os; os.fsync(self.fh.fileno())
flushed_lsn = self.buffer[-1].lsn
self.buffer.clear()
return flushed_lsn
def perform_checkpoint(self) -> int:
"""Flush WAL, write checkpoint record, return checkpoint LSN."""
redo_lsn = self.flush_wal()
self.current_lsn += 1
ckpt_lsn = self.current_lsn
ckpt_data = f"CHECKPOINT|{ckpt_lsn}|redo={redo_lsn}".encode()
self.fh.write(ckpt_data)
self.fh.flush()
import os; os.fsync(self.fh.fileno())
return ckpt_lsn
def recover_from_checkpoint(self, checkpoint_lsn: int) -> list[WALRecord]:
"""Read WAL from checkpoint_lsn forward and return records to replay."""
replayed: list[WALRecord] = []
# In practice: open WAL segments, seek to checkpoint_lsn, read forward
print(f"Replaying WAL from LSN {checkpoint_lsn} ...")
return replayed
Frequently Asked Questions
WAL vs Filesystem Journaling
Filesystem journaling protects filesystem metadata against corruption. WAL protects database logical data against loss of committed transactions. Both operate simultaneously on a running database server — the journal keeps the filesystem consistent; WAL keeps the database transactionally consistent.
How Group Commit Works
Group commit batches fsync calls from multiple concurrent transactions into a single disk flush. On a disk with 5ms fsync latency, individual commits cap at 200 TPS. With group commit batching 500 transactions per fsync, the same disk can sustain far higher commit rates at the cost of up to one fsync-latency of additional commit delay.
Checkpoint Frequency Tradeoff
More frequent checkpoints reduce crash recovery time but increase background I/O. Less frequent checkpoints reduce background write pressure but extend recovery time after a crash. The right interval depends on write workload intensity, available I/O bandwidth, and the acceptable recovery time objective (RTO).
Log Shipping vs Streaming Replication
Log shipping copies completed WAL segment files (16 MB default), creating replication lag of potentially seconds to minutes. Streaming replication sends WAL records over TCP as they are generated, reducing lag to milliseconds. Streaming requires a persistent connection; log shipping works asynchronously and suits geographically distant standbys.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Atlassian Interview Guide