Write-Ahead Log Storage Low-Level Design: Log Format, Checkpoint, Recovery, and Log Shipping

⏱ 10 min read

What Is a Write-Ahead Log and Why Does It Exist?

A write-ahead log (WAL) is the foundation of durability in every serious database engine. Before any change is applied to a data page, a record describing that change is appended to the log and flushed to durable storage. Only after the log record is safely on disk does the engine acknowledge the commit to the client. This single rule — log before apply — is what lets a database survive crashes without losing committed data.

WAL serves three distinct purposes: durability (committed transactions survive crashes), recoverability (the engine can reconstruct any state by replaying log records), and replication (standbys can apply the same log stream the primary generates). Understanding the internals of WAL design reveals why these guarantees are achievable at high throughput.

Log Record Format and LSN

Every WAL record carries a Log Sequence Number (LSN) — a monotonically increasing 64-bit integer that identifies the record's position in the log stream. LSNs are used everywhere: to identify checkpoint positions, to track how far a standby has applied, and to determine which records need replay on recovery.

A minimal WAL record contains:

lsn — position in the log byte stream (BIGINT)
transaction_id — which transaction produced this change
record_type — INSERT, UPDATE, DELETE, COMMIT, ABORT, CHECKPOINT
relation_oid — which table or index was modified
block_number — which page within the relation
before_image — page content before the change (for undo)
after_image — page content after the change (for redo)
checksum — CRC32 or similar to detect torn writes

Physical WAL records a full page image on the first modification after a checkpoint (full-page writes) to protect against torn page writes when a system crashes mid-write on a 4KB OS page that does not align with an 8KB database page.

Sequential Write Performance

WAL is written sequentially — every new record is appended to the end of the current log segment. This matters enormously for throughput. Sequential disk writes saturate bandwidth; random writes are limited by seek latency on spinning disks and write amplification on SSDs. Because data page modifications are random (a write might touch page 1, page 50000, page 3 in sequence), the WAL absorbs the write traffic as sequential I/O and defers the random data-page writes to checkpoint time.

Log segments are typically fixed-size files (16 MB in PostgreSQL, 64 MB in MySQL InnoDB). When a segment fills, the engine opens the next one. Older segments are either archived or recycled.

fsync and Group Commit

Durability requires that WAL bytes reach durable storage — not just the OS page cache — before acknowledging a commit. This is achieved with fsync() or fdatasync(). The cost of one fsync per transaction limits throughput on high-latency storage.

Group commit batches multiple concurrent transactions into a single fsync call. Transactions that arrive while an fsync is in flight are queued; when the fsync completes, all queued transactions are acknowledged simultaneously. This trades individual commit latency for throughput — a single 5ms fsync can acknowledge hundreds of commits that stacked up while waiting. Group commit is the primary mechanism databases use to sustain thousands of write TPS on spinning disks.

Checkpoint Protocol

Without checkpoints, recovery would require replaying the entire log from the beginning of time. A checkpoint periodically flushes all dirty data pages from the buffer pool to disk and writes a checkpoint record to the WAL. Recovery only needs to replay log records from the most recent checkpoint LSN forward.

The checkpoint record stores the redo LSN — the earliest LSN from which replay must begin, which is the oldest LSN of any dirty page that was in the buffer pool when the checkpoint started.

Two checkpoint types appear in real systems:

Full checkpoint — flush all dirty pages. Safe and clean, but causes a large I/O spike.
Incremental/fast checkpoint — flush dirty pages gradually over a time window, spreading the I/O load. PostgreSQL's checkpoint_completion_target (default 0.9) spreads the flush over 90% of the checkpoint interval.

Checkpoint frequency is a tradeoff: more frequent checkpoints reduce recovery time but increase background I/O. Less frequent checkpoints allow longer recovery windows after a crash.

Crash Recovery via Log Replay

On restart after a crash, the engine performs ARIES-style recovery in three phases:

Analysis — scan the log from the last checkpoint to reconstruct which transactions were active at crash time and which pages were dirty.
Redo — replay all log records from the redo LSN forward, applying both committed and in-progress transaction changes to bring pages to their state at the crash point.
Undo — roll back all transactions that were active at crash time (no COMMIT record found), using before-images to reverse their changes.

The redo phase is idempotent — applying a WAL record to a page that already reflects that change (because the data page was flushed before the crash) produces no incorrect result, because each record carries the LSN of the page it modified and the engine skips records whose LSN is already reflected in the page header.

Log Shipping and Physical Replication

Log shipping is the simplest replication model: completed WAL segment files are copied from the primary to the standby (via rsync, S3, or a custom archiver). The standby applies segments in order. The replication lag equals at least one full segment (16 MB default), which can represent seconds of data.

Streaming replication eliminates segment-boundary lag by streaming WAL records over a TCP connection as they are generated. The standby's receiver process writes records to a local WAL buffer and signals the startup process to apply them. The primary tracks each standby's flush_lsn and apply_lsn for lag monitoring.

WAL archiving copies completed segments to S3 or NFS for point-in-time recovery (PITR). Combined with a base backup, a full archive allows restoring the database to any LSN in the archive window.

SQL Data Model

-- WAL record storage (typically in the WAL files themselves; this is a logical representation)
CREATE TABLE WALRecord (
    lsn             BIGINT PRIMARY KEY,
    transaction_id  BIGINT NOT NULL,
    record_type     VARCHAR(32) NOT NULL,  -- INSERT, UPDATE, DELETE, COMMIT, ABORT, CHECKPOINT
    relation_oid    OID,
    block_number    BIGINT,
    before_image    BYTEA,
    after_image     BYTEA,
    checksum        INT NOT NULL,
    recorded_at     TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Checkpoint tracking
CREATE TABLE CheckpointRecord (
    lsn             BIGINT PRIMARY KEY,
    redo_lsn        BIGINT NOT NULL,       -- earliest LSN needed for recovery
    dirty_page_count INT NOT NULL,
    checkpoint_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Standby replication state
CREATE TABLE WALShipState (
    standby_id      UUID PRIMARY KEY,
    apply_lsn       BIGINT NOT NULL,       -- last LSN applied
    flush_lsn       BIGINT NOT NULL,       -- last LSN flushed to standby disk
    last_seen_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_walrecord_txn ON WALRecord(transaction_id);
CREATE INDEX idx_walshipstate_apply ON WALShipState(apply_lsn);

Python Implementation Sketch

import struct, hashlib, time, io
from dataclasses import dataclass, field
from typing import Optional

WAL_SEGMENT_SIZE = 16 * 1024 * 1024  # 16 MB

@dataclass
class WALRecord:
    lsn: int
    transaction_id: int
    record_type: str
    relation_oid: Optional[int]
    block_number: Optional[int]
    before_image: bytes
    after_image: bytes

    def serialize(self) -> bytes:
        data = f"{self.lsn}|{self.transaction_id}|{self.record_type}".encode()
        checksum = struct.pack(">I", int(hashlib.crc32(data) & 0xFFFFFFFF))
        return data + checksum

class WALWriter:
    def __init__(self, path: str):
        self.path = path
        self.current_lsn = 0
        self.buffer: list[WALRecord] = []
        self.fh = open(path, "ab")

    def append_wal_record(self, txn_id: int, record_type: str,
                          relation: Optional[int], before: bytes, after: bytes) -> int:
        self.current_lsn += 1
        rec = WALRecord(self.current_lsn, txn_id, record_type,
                        relation, None, before, after)
        self.buffer.append(rec)
        return self.current_lsn

    def flush_wal(self, group_commit: bool = True) -> int:
        """Write buffer to disk and fsync. Returns highest flushed LSN."""
        if not self.buffer:
            return self.current_lsn
        for rec in self.buffer:
            self.fh.write(rec.serialize())
        self.fh.flush()
        if not group_commit:
            import os; os.fsync(self.fh.fileno())
        flushed_lsn = self.buffer[-1].lsn
        self.buffer.clear()
        return flushed_lsn

    def perform_checkpoint(self) -> int:
        """Flush WAL, write checkpoint record, return checkpoint LSN."""
        redo_lsn = self.flush_wal()
        self.current_lsn += 1
        ckpt_lsn = self.current_lsn
        ckpt_data = f"CHECKPOINT|{ckpt_lsn}|redo={redo_lsn}".encode()
        self.fh.write(ckpt_data)
        self.fh.flush()
        import os; os.fsync(self.fh.fileno())
        return ckpt_lsn

    def recover_from_checkpoint(self, checkpoint_lsn: int) -> list[WALRecord]:
        """Read WAL from checkpoint_lsn forward and return records to replay."""
        replayed: list[WALRecord] = []
        # In practice: open WAL segments, seek to checkpoint_lsn, read forward
        print(f"Replaying WAL from LSN {checkpoint_lsn} ...")
        return replayed

Frequently Asked Questions

WAL vs Filesystem Journaling

Filesystem journaling protects filesystem metadata against corruption. WAL protects database logical data against loss of committed transactions. Both operate simultaneously on a running database server — the journal keeps the filesystem consistent; WAL keeps the database transactionally consistent.

How Group Commit Works

Group commit batches fsync calls from multiple concurrent transactions into a single disk flush. On a disk with 5ms fsync latency, individual commits cap at 200 TPS. With group commit batching 500 transactions per fsync, the same disk can sustain far higher commit rates at the cost of up to one fsync-latency of additional commit delay.

Checkpoint Frequency Tradeoff

More frequent checkpoints reduce crash recovery time but increase background I/O. Less frequent checkpoints reduce background write pressure but extend recovery time after a crash. The right interval depends on write workload intensity, available I/O bandwidth, and the acceptable recovery time objective (RTO).

Log Shipping vs Streaming Replication

Log shipping copies completed WAL segment files (16 MB default), creating replication lag of potentially seconds to minutes. Streaming replication sends WAL records over TCP as they are generated, reducing lag to milliseconds. Streaming requires a persistent connection; log shipping works asynchronously and suits geographically distant standbys.