Low Level Design: MVCC Database Concurrency Control

⏱ 4 min read

Multi-Version Concurrency Control (MVCC) is the concurrency mechanism used by PostgreSQL, MySQL InnoDB, Oracle, and most modern databases to enable high concurrency without locking. Instead of blocking readers when a writer updates a row, MVCC maintains multiple versions of each row simultaneously. Readers see a consistent snapshot from the past; writers create new versions. This allows reads and writes to proceed in parallel without conflicts.

How MVCC Works

When a transaction begins, it receives a transaction ID (txid, monotonically increasing). Every row version has two hidden system columns: xmin (the txid that created this version) and xmax (the txid that deleted/updated this version, or infinity if still current). A transaction reads only row versions where xmin its_txid). When a row is updated, the old version is not overwritten: a new row version is created (new xmin = current txid); the old version is marked deleted (xmax = current txid). The old version remains in the table until VACUUM removes it after all transactions that could see it have ended.

Isolation Levels

PostgreSQL MVCC implements four isolation levels. READ UNCOMMITTED: reads the latest version regardless of commit status (PostgreSQL actually implements this as READ COMMITTED). READ COMMITTED: each statement within a transaction sees the latest committed version at statement start — different statements in the same transaction may see different versions. REPEATABLE READ: the snapshot is taken at the start of the first statement; all statements in the transaction see the same consistent snapshot. SERIALIZABLE (SSI — Serializable Snapshot Isolation): detects read-write conflicts and aborts transactions that would violate serializability — no phantom reads, no write skew.

Write Skew and Phantom Reads

Write skew is a concurrency anomaly that REPEATABLE READ does not prevent. Example: two concurrent transactions both read that there is at least one on-call doctor, each updates themselves to “off call” — both see the original count, both proceed, leaving zero on-call doctors. Under REPEATABLE READ, both transactions succeed (each sees a snapshot where the constraint holds). Under SERIALIZABLE (SSI), PostgreSQL detects the read-write dependency and aborts one transaction. Phantom reads: a transaction reads a set of rows (SELECT * FROM orders WHERE amount > 1000); another transaction inserts a new qualifying row and commits; the first transaction reads again and sees the new row (phantom). Prevented by REPEATABLE READ and SERIALIZABLE isolation levels.

VACUUM and Dead Tuple Bloat

MVCC accumulates dead row versions (old versions no longer visible to any transaction). Dead tuples waste storage and slow sequential scans (the scan must examine dead tuples to determine visibility). VACUUM reclaims dead tuple storage. Autovacuum runs VACUUM automatically when table bloat exceeds thresholds (default: 20% of table + 50 rows). For high-write tables, autovacuum may not keep up with dead tuple accumulation — symptoms: table bloat, query latency regression, xid wraparound warnings. Tune autovacuum: lower autovacuum_vacuum_scale_factor (trigger VACUUM at 5% bloat instead of 20%), increase autovacuum workers (more parallel VACUUM jobs), and run manual VACUUM ANALYZE after bulk inserts or updates.

Transaction ID Wraparound

PostgreSQL transaction IDs are 32-bit integers. At 2^31 txids ahead of the current txid, a row becomes invisible (the comparison wraps around). If a database accumulates ~2 billion unvacuumed transactions, it approaches transaction ID wraparound — a catastrophic condition where all data appears invisible. VACUUM FREEZE prevents wraparound by replacing txids older than vacuum_freeze_min_age with a special frozen state that is always visible. PostgreSQL warns (and eventually refuses new transactions) as wraparound approaches. Monitor: SELECT max(age(datfrozenxid)) FROM pg_database — alert if this exceeds 1.5 billion. Emergency: run VACUUM FREEZE on affected tables immediately.

Optimistic vs Pessimistic Locking with MVCC

MVCC enables optimistic concurrency: readers never block writers; writers rarely block each other. This is why read-heavy workloads scale well on MVCC databases. However, concurrent writes to the same row still require conflict detection. First-write-wins: the second writer to update the same row blocks until the first commits; then the second writer’s snapshot is stale — it re-reads the row (which now has xmax set) and applies its update to the new version. Explicit locking (SELECT … FOR UPDATE): the transaction acquires a row-level lock that blocks other writers until it commits. Use SELECT … FOR UPDATE for operations that read and then update the same row to prevent lost updates — the read and write are atomic from a locking perspective.

MVCC in Distributed Databases

Distributed MVCC (CockroachDB, Google Spanner, YugabyteDB) extends MVCC across nodes. The challenge: defining “transaction timestamp” across machines with unsynchronized clocks. Spanner uses TrueTime — atomic clocks and GPS provide bounded clock uncertainty; Spanner waits for the uncertainty interval before committing to ensure causal ordering. CockroachDB uses hybrid logical clocks (HLC) — logical timestamps that track causality without atomic hardware. Distributed MVCC provides the same isolation guarantees as single-node MVCC but with cross-node conflict detection requiring consensus (Paxos, Raft) for write ordering, adding 10-50ms latency versus microseconds in single-node MVCC.