Low Level Design: MVCC Concurrency Control

⏱ 8 min read

Core Concept: How MVCC Works

Multi-Version Concurrency Control (MVCC) solves a fundamental database problem: how to allow readers and writers to operate concurrently without blocking each other. The key insight is to maintain multiple versions of each row simultaneously rather than having a single mutable copy protected by locks.

In MVCC, readers never block writers and writers never block readers. A reader sees a consistent snapshot of the database as of a specific point in time — it never sees a partially-written transaction or a row being modified by another transaction. This eliminates read-write lock contention entirely, which is the dominant source of concurrency bottlenecks in traditional locking systems.

Each transaction operates as if it has its own private copy of the database. When a row is updated, the old version is preserved alongside the new version. Readers that started before the update see the old version; readers that started after see the new version. This snapshot semantics is what makes MVCC so powerful for workloads with many concurrent readers.

The tradeoff is storage and garbage collection overhead: old versions accumulate and must eventually be reclaimed. But for most workloads, the concurrency benefits far outweigh the cleanup cost.

Version Chain Design

Each row in an MVCC database has a version chain: a linked list of all versions of that row, from the current version back to progressively older versions. The structure varies by database system but the logical model is consistent.

Each version contains three critical pieces of information:

xmin: the transaction ID that created this version (by INSERT or UPDATE)
xmax: the transaction ID that deleted or superseded this version (by DELETE or UPDATE); null if this is the current version
data: the actual column values for this version

In PostgreSQL, old and new versions are both stored as tuples in the same heap (table file). An UPDATE physically inserts a new tuple and marks the old tuple’s xmax. The heap can contain many dead tuples that are no longer visible to any transaction — these must be reclaimed by VACUUM. Old versions are linked via ctid (current tuple ID) pointers.

In MySQL InnoDB, the architecture is inverted: the heap always contains the latest version of each row. Older versions are stored in a separate undo log tablespace. Each row has a rollback pointer (DB_ROLL_PTR) that points into the undo log chain. To reconstruct an old version, InnoDB follows the rollback pointer and applies undo records in reverse.

InnoDB also stores DB_TRX_ID (the transaction ID that last modified the row) directly in each row. This allows a quick check: if the row’s transaction ID is older than the reader’s snapshot, the row is immediately visible without consulting the undo log.

Snapshot Isolation: Visibility Rules

When a transaction begins under snapshot isolation, the database records a snapshot: the set of transaction IDs that are currently in progress (active) at that moment. This snapshot is the foundation of all visibility decisions.

A row version is visible to a transaction if and only if both conditions hold:

xmin committed before the snapshot: the transaction that created this version had already committed when the snapshot was taken
xmax is null OR xmax not committed before the snapshot: the transaction that deleted this version either hasn’t happened yet, is still in progress, or aborted

Formally: version V is visible if xmin < snapshot_xid AND xmin NOT IN active_set AND (xmax IS NULL OR xmax >= snapshot_xid OR xmax IN active_set OR xmax aborted).

This means a transaction can safely ignore any row that was created after it started (xmin too large) or deleted before it started (xmax committed and old enough). The snapshot is immutable for the transaction’s lifetime — it sees a frozen view of the world.

One important detail: the snapshot must include the list of in-progress transactions, not just a single high-water-mark transaction ID. A transaction with ID 1000 might have started before transaction 900 committed, so without the active set, the visibility rule would incorrectly show 900’s changes as invisible.

Read Committed vs Repeatable Read

MVCC naturally supports multiple isolation levels by varying when the snapshot is taken:

Read Committed takes a new snapshot at the start of each SQL statement. This means a transaction can see different data in different statements — if another transaction commits between two SELECT statements, the second SELECT will see the new data. This is the default isolation level in PostgreSQL and MySQL InnoDB.

Read Committed allows non-repeatable reads: reading the same row twice in one transaction can yield different values if a concurrent transaction commits in between. It also allows phantom reads: a re-executed range query can return different rows.

Repeatable Read takes a single snapshot at transaction start and uses it for all statements in the transaction. The transaction sees a completely stable view of the database — no matter how many other transactions commit during its lifetime, it sees only the state at its start time.

Repeatable read eliminates non-repeatable reads and phantoms (in PostgreSQL’s implementation; the SQL standard only requires eliminating non-repeatable reads at this level). PostgreSQL implements both levels with the same MVCC mechanism — the only difference is snapshot timing. InnoDB’s repeatable read also uses a consistent snapshot but with gap locks to prevent phantoms.

The practical implication: read committed is better for OLTP workloads where you want fresh data and can tolerate some inconsistency within a transaction. Repeatable read is better for reporting queries that need a consistent view across multiple statements.

Write Conflicts and First-Committer-Wins

MVCC eliminates read-write conflicts but cannot eliminate write-write conflicts — two transactions trying to update the same row simultaneously require arbitration.

When a transaction tries to update a row, it checks the row’s xmax:

If xmax is null or from an aborted transaction: safe to proceed, write the new version
If xmax is from a committed transaction: the row was already updated; behavior depends on isolation level
If xmax is from an in-progress transaction: block and wait until the other transaction commits or aborts

Under First-Committer-Wins (the rule under snapshot isolation), if the first writer commits, the second writer sees a committed xmax and must either abort (PostgreSQL’s behavior at repeatable read/serializable) or re-read and re-apply the update on the new version (MySQL InnoDB’s behavior at repeatable read).

PostgreSQL raises ERROR: could not serialize access due to concurrent update and forces the application to retry the transaction. This is correct behavior under snapshot isolation — the second transaction’s snapshot is now stale relative to the committed data.

Write-write conflicts are relatively rare in well-designed applications that shard work by entity ID, but they are the dominant source of transaction aborts in high-contention scenarios like inventory systems or seat booking.

Vacuum and Garbage Collection

The fundamental maintenance challenge in MVCC is dead tuple accumulation. Every UPDATE creates a new tuple and leaves the old one in place (in PostgreSQL’s heap model). Every DELETE leaves a dead tuple. These dead tuples consume disk space and slow down sequential scans and index lookups.

PostgreSQL’s VACUUM process reclaims dead tuples. A tuple is safe to remove when its xmax is from a committed transaction AND no active transaction has a snapshot old enough to see the dead version. VACUUM scans heap pages, removes dead tuples, and updates the free space map. It does not return space to the OS by default — VACUUM FULL rewrites the table to reclaim disk space but requires an exclusive lock.

autovacuum runs automatically based on configurable thresholds (e.g., vacuum when 20% of rows are dead). Falling behind on autovacuum causes table bloat — a serious operational issue for high-write tables.

Transaction ID wraparound is a critical PostgreSQL-specific concern. PostgreSQL uses 32-bit transaction IDs, which wrap around every ~2.1 billion transactions. If a transaction ID wraps around to an ID that looks "in the future" to old tuples, those tuples would become invisible — catastrophic data loss. VACUUM FREEZE marks old tuples with a special frozen transaction ID that is always considered committed and visible, preventing wraparound. PostgreSQL will forcibly stop accepting new transactions if wraparound becomes imminent (within ~10 million transactions), requiring emergency VACUUM.

InnoDB’s purge thread removes old undo records instead of heap tuples. Since InnoDB always stores the latest version in-place, purging is cleaner — old undo records are deleted from the undo log tablespace once no active transaction needs them. InnoDB is less susceptible to table bloat but undo log growth under long-running transactions is a common operational problem.

PostgreSQL vs MySQL InnoDB Implementation Details

The architectural difference between PostgreSQL and InnoDB has significant operational implications:

PostgreSQL (heap-based MVCC):

Dead tuples accumulate in heap pages alongside live tuples
Sequential scans must skip dead tuples, increasing I/O
Index entries for dead tuples remain until VACUUM cleans the heap (index-only scans check the visibility map first)
Table bloat is a real operational concern; aggressive autovacuum tuning is required for high-write tables
VACUUM can run concurrently with reads and most writes (only VACUUM FULL requires exclusive lock)
Hot Standby replicas can hold back VACUUM via replication slots — a common cause of runaway bloat

MySQL InnoDB (undo log MVCC):

Heap always contains latest version; no table bloat from dead tuples
Long-running transactions cause undo log growth (ibdata1 or undo tablespace bloat)
Undo logs are not reclaimable until the purge thread processes them
Rollback of large transactions is expensive — must apply all undo records in reverse
Purge lag under high write load causes undo log to grow unboundedly

For read-heavy workloads, InnoDB’s model is generally more efficient. For write-heavy workloads with many updates, PostgreSQL’s HOT (Heap-Only Tuple) optimization can update rows without touching indexes when the row fits in the same page.

Serializable Snapshot Isolation (SSI)

Standard snapshot isolation prevents dirty reads, non-repeatable reads, and phantoms, but it does not provide full serializability. The canonical counterexample is the write skew anomaly: two transactions each read a set of rows, make a decision based on what they see, and write to non-overlapping rows in a way that violates an application invariant — yet neither transaction’s writes conflict with the other’s reads as far as MVCC is concerned.

Serializable Snapshot Isolation (SSI), introduced by Cahill et al. (2008) and implemented in PostgreSQL since version 9.1, extends MVCC to provide true serializability without the overhead of two-phase locking.

SSI tracks rw-anti-dependencies: cases where transaction T1 reads a version that T2 subsequently modifies (T1 reads old, T2 writes new). A dangerous structure is a cycle of two rw-anti-dependencies: T1 rw-depends on T2, and T2 rw-depends on T1 (or via a third transaction). Such a cycle is a sign that no serial execution could produce the same result.

When SSI detects a dangerous rw-anti-dependency cycle, it aborts one of the transactions (the pivot) with ERROR: could not serialize access due to read/write dependencies among transactions. The application must retry.

The overhead of SSI is tracking read sets and rw-dependencies. PostgreSQL uses SIREAD locks (predicate locks) to track what each transaction has read without actually blocking. The abort rate under SSI depends heavily on the workload — for most OLTP workloads, it is low enough that SSI is practical.