MVCC (Multi-Version Concurrency Control) is the foundation of concurrent transaction processing in PostgreSQL, MySQL InnoDB, Oracle, and most modern distributed databases. Instead of overwriting data in place, every write creates a new row version. Readers access old versions without blocking writers, and writers do not block readers. This post covers the complete low-level design.
MVCC Fundamentals
The core invariant: reads never block writes, writes never block reads. This is achieved by retaining multiple versions of each row simultaneously. A reader picks the version that was current at its snapshot time. A writer creates a new version without touching what readers are currently accessing.
Two storage approaches dominate:
- Heap-based (PostgreSQL): new row versions are written to the table heap; old versions remain in place as dead tuples until vacuumed
- Delta-based / undo log (MySQL InnoDB): the current version is stored in the clustered index; old versions are reconstructed by following an undo log chain
Version Chain — Heap-Based (PostgreSQL)
Each row in PostgreSQL has hidden system columns:
xmin: transaction ID that inserted this row version (created it)xmax: transaction ID that deleted or updated this row version (0/null if current)
On UPDATE: a new tuple is inserted with xmin = current_xid; the old tuple gets xmax = current_xid. On DELETE: only xmax is set on the existing tuple — no physical removal.
Multiple versions of the same logical row coexist in the heap. The visibility rule (derived from the snapshot) determines which version a transaction sees.
Version Chain — Delta-Based (MySQL InnoDB)
InnoDB stores the current row version in the clustered index (B-tree by primary key). Each row has a DB_TRX_ID (creating transaction) and a DB_ROLL_PTR pointer into the undo log. Older versions are not stored in the main table — they are stored as undo records and reconstructed by following the DB_ROLL_PTR chain backwards.
Trade-off: heap-based approach makes reads of old versions expensive only when many versions exist (must search the heap); delta-based makes reads of old versions require undo log traversal but avoids heap bloat.
Visibility Rule
A row version created by transaction x is visible to a transaction with snapshot S if:
xis committed (not just started)xcommitted beforeSwas taken (i.e.,x < S.snapshot_xidandxnot inS.active_xids)- The version has not been deleted by a transaction visible to
S(i.e.,xmaxis null orxmaxis not visible toS)
Write Path in Detail
INSERT: create a new heap tuple with xmin = current_xid, xmax = 0. Update any affected indexes to point to the new tuple.
UPDATE: mark the existing tuple with xmax = current_xid; insert a new tuple with xmin = current_xid and updated data. Index entries for unchanged columns can reuse the old index tuple (PostgreSQL HOT — Heap-Only Tuple optimization).
DELETE: set xmax = current_xid on the existing tuple. No immediate physical removal.
Hot Row Problem
A “hot row” is a row updated at high frequency — e.g., a counter, balance, or queue size. Each update creates a new version. Readers of older snapshots must traverse the version chain to find their visible version. With thousands of updates per second, the chain can grow to millions of versions, making reads O(chain_length) expensive.
Mitigations:
- Counter aggregation tables: instead of updating one row, insert increment rows and aggregate on read (append-only design)
- Mini-transactions: batch multiple updates into fewer transactions to reduce chain growth
- Sharded counters: distribute the hot row across multiple rows; aggregate on read
Index Updates and Bloat
Every new row version may require new index entries. In PostgreSQL, each UPDATE creates a new heap tuple and potentially new index tuples in every index on the table. Dead index entries accumulate alongside dead heap tuples. Vacuum removes dead heap tuples and the corresponding dead index entries. High update rates without adequate vacuum frequency cause index bloat — large, sparse B-tree pages that waste disk space and slow index scans.
SQL Schema
-- Row version chain (for design illustration)
CREATE TABLE RowVersionChain (
id BIGSERIAL PRIMARY KEY,
row_id BIGINT NOT NULL,
xmin BIGINT NOT NULL, -- creating transaction
xmax BIGINT, -- deleting/updating transaction (NULL = current)
data JSONB NOT NULL,
undo_pointer BIGINT REFERENCES RowVersionChain(id) -- previous version
);
CREATE INDEX idx_rvc_row_xmin ON RowVersionChain(row_id, xmin DESC);
CREATE INDEX idx_rvc_xmax ON RowVersionChain(xmax) WHERE xmax IS NOT NULL;
-- Vacuum state tracking per table
CREATE TABLE VacuumState (
table_name TEXT PRIMARY KEY,
oldest_horizon_xid BIGINT, -- oldest snapshot still needing old versions
last_vacuum_at TIMESTAMPTZ,
dead_tuples_estimate BIGINT NOT NULL DEFAULT 0
);
Python Implementation
from dataclasses import dataclass, field
from typing import Optional
import threading
import time
@dataclass
class RowTuple:
row_id: int
xmin: int
xmax: Optional[int]
data: dict
undo_pointer: Optional["RowTuple"] = None # delta-based chain link
class MVCCTable:
"""Simplified MVCC table storage supporting both heap and delta approaches."""
def __init__(self, name: str):
self.name = name
# heap: row_id -> list of versions (heap-based)
self._heap: dict[int, list[RowTuple]] = {}
self._committed: set[int] = set()
self._active: set[int] = set()
self._lock = threading.Lock()
def read_visible_version(
self,
row_id: int,
snapshot_xid: int,
active_at_snapshot: frozenset[int]
) -> Optional[RowTuple]:
"""Return the row version visible to the given snapshot."""
versions = self._heap.get(row_id, [])
# Traverse from newest to oldest (heap-based)
for v in reversed(versions):
if self._version_visible(v, snapshot_xid, active_at_snapshot):
return v
return None
def _version_visible(
self,
v: RowTuple,
snap_xid: int,
active_at_snap: frozenset[int]
) -> bool:
# xmin must be committed and predate snapshot
if not self._committed_before(v.xmin, snap_xid, active_at_snap):
return False
# xmax must not be committed before snapshot
if v.xmax is not None and self._committed_before(v.xmax, snap_xid, active_at_snap):
return False
return True
def _committed_before(self, xid: int, snap_xid: int, active_at_snap: frozenset[int]) -> bool:
if xid >= snap_xid:
return False
if xid in active_at_snap:
return False
return xid in self._committed
def write_new_version(self, row_id: int, data: dict, xid: int) -> None:
"""INSERT or UPDATE: create new version, mark old current version with xmax."""
with self._lock:
versions = self._heap.setdefault(row_id, [])
for v in versions:
if v.xmax is None:
v.xmax = xid # mark old version as superseded
versions.append(RowTuple(row_id=row_id, xmin=xid, xmax=None, data=data))
def delete_version(self, row_id: int, xid: int) -> bool:
"""DELETE: set xmax on current version."""
with self._lock:
versions = self._heap.get(row_id, [])
for v in versions:
if v.xmax is None:
v.xmax = xid
return True
return False
def vacuum_table(self, horizon_xid: int) -> dict:
"""Remove dead tuples whose xmax is below the vacuum horizon."""
removed_tuples = 0
with self._lock:
for row_id, versions in list(self._heap.items()):
dead = [v for v in versions
if v.xmax is not None and v.xmax in self._committed
and v.xmax int:
"""Count tuples with xmax set (dead or pending vacuum)."""
with self._lock:
return sum(
1 for versions in self._heap.values()
for v in versions
if v.xmax is not None
)
def chain_length(self, row_id: int) -> int:
"""Return version chain length for a row — hot row diagnostic."""
return len(self._heap.get(row_id, []))
Heap vs Delta Version Storage Trade-offs
Heap-based (PostgreSQL): writes are fast (append new tuple); reads of old versions may require scanning multiple heap pages; vacuum is expensive (full table scan for dead tuples); table bloat is visible and measurable.
Delta-based (InnoDB): the current version is always the primary B-tree entry (fast current reads); old version reads require undo log traversal; undo log is separate from the main table (no table bloat); undo log can become a bottleneck under high update rates.
Vacuum tuning: PostgreSQL autovacuum thresholds (autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor) control when vacuum runs per table. For high-update tables, lower the scale factor (e.g., 0.01 instead of 0.2) so vacuum runs after 1% of rows are dead rather than 20%.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between heap-based and delta-based MVCC version storage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In heap-based MVCC (PostgreSQL), all row versions are stored in the table heap. New versions are appended; old versions remain as dead tuples until vacuum removes them. Table bloat occurs without regular vacuuming. In delta-based MVCC (MySQL InnoDB), the current version is the primary B-tree entry and old versions are stored in the undo log. Reading an old version requires traversing the undo log chain. Delta-based avoids table bloat but adds undo log I/O for old version reads.”
}
},
{
“@type”: “Question”,
“name”: “What causes the hot row problem in MVCC databases and how can it be mitigated?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A hot row is one that is updated at high frequency (e.g., a counter). Each update creates a new row version, growing the version chain. Readers needing an old snapshot must traverse the entire chain to find their visible version, making reads O(chain_length). Mitigations include: append-only counter tables where increments are inserted as rows and aggregated on read; sharded counters distributing updates across multiple rows; and ensuring vacuum runs frequently enough to prune chains for committed dead versions.”
}
},
{
“@type”: “Question”,
“name”: “How should vacuum be tuned for high-update-rate tables in PostgreSQL?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “For high-update tables, lower autovacuum_vacuum_scale_factor from the default 0.2 to 0.01 or lower, so vacuum triggers after 1% of the table is dead rather than 20%. Also lower autovacuum_vacuum_threshold to trigger vacuum on smaller absolute dead tuple counts. Set autovacuum_vacuum_cost_delay to 0 or a low value for these tables to allow vacuum to run faster. Monitor pg_stat_user_tables.n_dead_tup to verify vacuum is keeping up with the update rate.”
}
},
{
“@type”: “Question”,
“name”: “How does MVCC cause index bloat and how is it cleaned up?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Every row version update in PostgreSQL creates new index entries pointing to the new heap tuple. Old index entries pointing to dead heap tuples accumulate — index bloat. PostgreSQL's vacuum removes dead heap tuples and the corresponding index entries in a single pass. The HOT (Heap-Only Tuple) optimization avoids index updates when the updated columns are not indexed, keeping index entries pointing to the old heap tuple which chains to the new one. HOT significantly reduces index bloat for updates to non-indexed columns.”
}
}
]
}
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the hot row problem in MVCC and how is it mitigated?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A frequently updated row accumulates a long version chain; reading the row requires traversing the chain from newest to find the visible version, increasing with update frequency; mitigation includes counter aggregation tables, periodic vacuum to remove old versions, and partitioning hot data.”
}
},
{
“@type”: “Question”,
“name”: “How does heap-based MVCC differ from delta-based (undo log) MVCC?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Heap-based (PostgreSQL) writes new row versions in the heap alongside old ones; reads may find dead tuples and need vacuum; delta-based (MySQL InnoDB) writes the new row in place and stores the old version in an undo log chain, keeping the heap cleaner at the cost of undo log I/O.”
}
},
{
“@type”: “Question”,
“name”: “How does MVCC enable non-blocking reads?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Readers access the row version visible to their transaction snapshot without acquiring locks; concurrent writers create new versions rather than overwriting, so readers and writers never block each other.”
}
},
{
“@type”: “Question”,
“name”: “How is index bloat caused by MVCC and how is it handled?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each row version update may add a new index entry pointing to the new heap tuple; dead index entries from old versions accumulate as index bloat; VACUUM removes dead heap tuples and their corresponding index entries, and index REINDEX can compact the index if bloat is severe.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture