Write-Ahead Log Storage Low-Level Design: Log Format, Checkpoint, Recovery, and Log Shipping

What Is a Write-Ahead Log and Why Does It Exist?

A write-ahead log (WAL) is the foundation of durability in every serious database engine. Before any change is applied to a data page, a record describing that change is appended to the log and flushed to durable storage. Only after the log record is safely on disk does the engine acknowledge the commit to the client. This single rule — log before apply — is what lets a database survive crashes without losing committed data.

WAL serves three distinct purposes: durability (committed transactions survive crashes), recoverability (the engine can reconstruct any state by replaying log records), and replication (standbys can apply the same log stream the primary generates). Understanding the internals of WAL design reveals why these guarantees are achievable at high throughput.

Log Record Format and LSN

Every WAL record carries a Log Sequence Number (LSN) — a monotonically increasing 64-bit integer that identifies the record's position in the log stream. LSNs are used everywhere: to identify checkpoint positions, to track how far a standby has applied, and to determine which records need replay on recovery.

A minimal WAL record contains:

lsn — position in the log byte stream (BIGINT)
transaction_id — which transaction produced this change
record_type — INSERT, UPDATE, DELETE, COMMIT, ABORT, CHECKPOINT
relation_oid — which table or index was modified
block_number — which page within the relation
before_image — page content before the change (for undo)
after_image — page content after the change (for redo)
checksum — CRC32 or similar to detect torn writes

Physical WAL records a full page image on the first modification after a checkpoint (full-page writes) to protect against torn page writes when a system crashes mid-write on a 4KB OS page that does not align with an 8KB database page.

Sequential Write Performance

WAL is written sequentially — every new record is appended to the end of the current log segment. This matters enormously for throughput. Sequential disk writes saturate bandwidth; random writes are limited by seek latency on spinning disks and write amplification on SSDs. Because data page modifications are random (a write might touch page 1, page 50000, page 3 in sequence), the WAL absorbs the write traffic as sequential I/O and defers the random data-page writes to checkpoint time.

Log segments are typically fixed-size files (16 MB in PostgreSQL, 64 MB in MySQL InnoDB). When a segment fills, the engine opens the next one. Older segments are either archived or recycled.

fsync and Group Commit

Durability requires that WAL bytes reach durable storage — not just the OS page cache — before acknowledging a commit. This is achieved with fsync() or fdatasync(). The cost of one fsync per transaction limits throughput on high-latency storage.

Group commit batches multiple concurrent transactions into a single fsync call. Transactions that arrive while an fsync is in flight are queued; when the fsync completes, all queued transactions are acknowledged simultaneously. This trades individual commit latency for throughput — a single 5ms fsync can acknowledge hundreds of commits that stacked up while waiting. Group commit is the primary mechanism databases use to sustain thousands of write TPS on spinning disks.

Checkpoint Protocol

Without checkpoints, recovery would require replaying the entire log from the beginning of time. A checkpoint periodically flushes all dirty data pages from the buffer pool to disk and writes a checkpoint record to the WAL. Recovery only needs to replay log records from the most recent checkpoint LSN forward.

The checkpoint record stores the redo LSN — the earliest LSN from which replay must begin, which is the oldest LSN of any dirty page that was in the buffer pool when the checkpoint started.

Two checkpoint types appear in real systems:

Full checkpoint — flush all dirty pages. Safe and clean, but causes a large I/O spike.
Incremental/fast checkpoint — flush dirty pages gradually over a time window, spreading the I/O load. PostgreSQL's checkpoint_completion_target (default 0.9) spreads the flush over 90% of the checkpoint interval.

Checkpoint frequency is a tradeoff: more frequent checkpoints reduce recovery time but increase background I/O. Less frequent checkpoints allow longer recovery windows after a crash.

Crash Recovery via Log Replay

On restart after a crash, the engine performs ARIES-style recovery in three phases:

Analysis — scan the log from the last checkpoint to reconstruct which transactions were active at crash time and which pages were dirty.
Redo — replay all log records from the redo LSN forward, applying both committed and in-progress transaction changes to bring pages to their state at the crash point.
Undo — roll back all transactions that were active at crash time (no COMMIT record found), using before-images to reverse their changes.

The redo phase is idempotent — applying a WAL record to a page that already reflects that change (because the data page was flushed before the crash) produces no incorrect result, because each record carries the LSN of the page it modified and the engine skips records whose LSN is already reflected in the page header.

Log Shipping and Physical Replication

Log shipping is the simplest replication model: completed WAL segment files are copied from the primary to the standby (via rsync, S3, or a custom archiver). The standby applies segments in order. The replication lag equals at least one full segment (16 MB default), which can represent seconds of data.

Streaming replication eliminates segment-boundary lag by streaming WAL records over a TCP connection as they are generated. The standby's receiver process writes records to a local WAL buffer and signals the startup process to apply them. The primary tracks each standby's flush_lsn and apply_lsn for lag monitoring.

WAL archiving copies completed segments to S3 or NFS for point-in-time recovery (PITR). Combined with a base backup, a full archive allows restoring the database to any LSN in the archive window.

SQL Data Model

-- WAL record storage (typically in the WAL files themselves; this is a logical representation)
CREATE TABLE WALRecord (
    lsn             BIGINT PRIMARY KEY,
    transaction_id  BIGINT NOT NULL,
    record_type     VARCHAR(32) NOT NULL,  -- INSERT, UPDATE, DELETE, COMMIT, ABORT, CHECKPOINT
    relation_oid    OID,
    block_number    BIGINT,
    before_image    BYTEA,
    after_image     BYTEA,
    checksum        INT NOT NULL,
    recorded_at     TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Checkpoint tracking
CREATE TABLE CheckpointRecord (
    lsn             BIGINT PRIMARY KEY,
    redo_lsn        BIGINT NOT NULL,       -- earliest LSN needed for recovery
    dirty_page_count INT NOT NULL,
    checkpoint_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Standby replication state
CREATE TABLE WALShipState (
    standby_id      UUID PRIMARY KEY,
    apply_lsn       BIGINT NOT NULL,       -- last LSN applied
    flush_lsn       BIGINT NOT NULL,       -- last LSN flushed to standby disk
    last_seen_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_walrecord_txn ON WALRecord(transaction_id);
CREATE INDEX idx_walshipstate_apply ON WALShipState(apply_lsn);

Python Implementation Sketch

import struct, hashlib, time, io
from dataclasses import dataclass, field
from typing import Optional

WAL_SEGMENT_SIZE = 16 * 1024 * 1024  # 16 MB

@dataclass
class WALRecord:
    lsn: int
    transaction_id: int
    record_type: str
    relation_oid: Optional[int]
    block_number: Optional[int]
    before_image: bytes
    after_image: bytes

    def serialize(self) -> bytes:
        data = f"{self.lsn}|{self.transaction_id}|{self.record_type}".encode()
        checksum = struct.pack(">I", int(hashlib.crc32(data) & 0xFFFFFFFF))
        return data + checksum

class WALWriter:
    def __init__(self, path: str):
        self.path = path
        self.current_lsn = 0
        self.buffer: list[WALRecord] = []
        self.fh = open(path, "ab")

    def append_wal_record(self, txn_id: int, record_type: str,
                          relation: Optional[int], before: bytes, after: bytes) -> int:
        self.current_lsn += 1
        rec = WALRecord(self.current_lsn, txn_id, record_type,
                        relation, None, before, after)
        self.buffer.append(rec)
        return self.current_lsn

    def flush_wal(self, group_commit: bool = True) -> int:
        """Write buffer to disk and fsync. Returns highest flushed LSN."""
        if not self.buffer:
            return self.current_lsn
        for rec in self.buffer:
            self.fh.write(rec.serialize())
        self.fh.flush()
        if not group_commit:
            import os; os.fsync(self.fh.fileno())
        flushed_lsn = self.buffer[-1].lsn
        self.buffer.clear()
        return flushed_lsn

    def perform_checkpoint(self) -> int:
        """Flush WAL, write checkpoint record, return checkpoint LSN."""
        redo_lsn = self.flush_wal()
        self.current_lsn += 1
        ckpt_lsn = self.current_lsn
        ckpt_data = f"CHECKPOINT|{ckpt_lsn}|redo={redo_lsn}".encode()
        self.fh.write(ckpt_data)
        self.fh.flush()
        import os; os.fsync(self.fh.fileno())
        return ckpt_lsn

    def recover_from_checkpoint(self, checkpoint_lsn: int) -> list[WALRecord]:
        """Read WAL from checkpoint_lsn forward and return records to replay."""
        replayed: list[WALRecord] = []
        # In practice: open WAL segments, seek to checkpoint_lsn, read forward
        print(f"Replaying WAL from LSN {checkpoint_lsn} ...")
        return replayed

Frequently Asked Questions

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between WAL and filesystem journaling?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Filesystem journaling protects filesystem metadata (directory entries, inode tables) against corruption. WAL protects database logical data (rows, index entries) against loss of committed transactions. A database running on a journaled filesystem uses both: the journal keeps the filesystem consistent, and WAL keeps the database transactionally consistent. WAL records are at the database page and row level, while journal records are at the filesystem block level.”
}
},
{
“@type”: “Question”,
“name”: “How does group commit improve WAL throughput?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Group commit batches the fsync calls from multiple concurrent transactions into a single disk flush. Instead of one fsync per transaction (which limits throughput to 1 / fsync_latency TPS), group commit amortizes the fsync cost across all transactions that arrived during the flush window. On a disk with 5ms fsync latency, individual commits cap at 200 TPS. With group commit batching 500 transactions per fsync, the same disk sustains 100,000 TPS at the cost of up to 5ms additional commit latency per transaction.”
}
},
{
“@type”: “Question”,
“name”: “What is the tradeoff in checkpoint frequency?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “More frequent checkpoints reduce crash recovery time (less WAL to replay) but increase background I/O (more dirty pages flushed per unit time) and can create I/O spikes that compete with foreground queries. Less frequent checkpoints reduce background I/O but extend the recovery window after a crash. The optimal interval depends on write workload intensity, available I/O bandwidth, and acceptable recovery time objective (RTO). PostgreSQL defaults to checkpoint every 5 minutes or every 1 GB of WAL, whichever comes first.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between log shipping and streaming replication?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Log shipping copies completed WAL segment files to the standby. The standby applies them after the segment is fully written and transferred, introducing replication lag of at least one segment (16 MB default, potentially seconds to minutes). Streaming replication sends WAL records over a TCP connection as they are generated, reducing replication lag to milliseconds. Streaming replication requires a persistent network connection; log shipping can use asynchronous file transfer and is simpler for geographically distant standbys.”
}
}
]
}

WAL vs Filesystem Journaling

Filesystem journaling protects filesystem metadata against corruption. WAL protects database logical data against loss of committed transactions. Both operate simultaneously on a running database server — the journal keeps the filesystem consistent; WAL keeps the database transactionally consistent.

How Group Commit Works

Group commit batches fsync calls from multiple concurrent transactions into a single disk flush. On a disk with 5ms fsync latency, individual commits cap at 200 TPS. With group commit batching 500 transactions per fsync, the same disk can sustain far higher commit rates at the cost of up to one fsync-latency of additional commit delay.

Checkpoint Frequency Tradeoff

More frequent checkpoints reduce crash recovery time but increase background I/O. Less frequent checkpoints reduce background write pressure but extend recovery time after a crash. The right interval depends on write workload intensity, available I/O bandwidth, and the acceptable recovery time objective (RTO).

Log Shipping vs Streaming Replication

Log shipping copies completed WAL segment files (16 MB default), creating replication lag of potentially seconds to minutes. Streaming replication sends WAL records over TCP as they are generated, reducing lag to milliseconds. Streaming requires a persistent connection; log shipping works asynchronously and suits geographically distant standbys.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does group commit improve WAL throughput?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Instead of fsyncing the WAL for each individual transaction, the system batches multiple transaction commits and fsyncs once; all transactions in the batch are committed together, amortizing the fsync cost across many transactions.”
}
},
{
“@type”: “Question”,
“name”: “What is the redo LSN in a checkpoint record?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The redo LSN is the WAL position from which recovery must start replaying to reconstruct a consistent state after the checkpoint; it may be earlier than the checkpoint LSN itself to account for transactions that started before the checkpoint but had not committed yet.”
}
},
{
“@type”: “Question”,
“name”: “How does log shipping differ from streaming replication?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Log shipping transfers complete WAL segment files (typically 16MB) from primary to standby after they are fully written — introducing lag up to a segment size; streaming replication sends WAL records in real time as they are generated, minimizing replication lag.”
}
},
{
“@type”: “Question”,
“name”: “When is WAL truncated after a checkpoint?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “WAL segments older than the checkpoint's redo LSN are no longer needed for recovery; they can be removed (or recycled) after the checkpoint completes, provided WAL archiving has copied them to the archive store if point-in-time recovery is required.”
}
}
]
}