Low Level Design: Database Replication

⏱ 5 min read

Database replication copies data from one database server to one or more replicas. Replication underpins high availability, horizontal read scaling, geographic distribution, and disaster recovery. Understanding its mechanics — WAL shipping, binlog formats, GTID, conflict resolution, and failover — is critical for any backend engineer working with production databases.

Why Replicate?

Replication serves several distinct purposes. High availability: if the primary fails, a replica can be promoted to continue serving traffic with minimal downtime. Read scaling: route read queries to replicas, reducing load on the primary — effective when the read/write ratio is high. Geo-distribution: place replicas in regions closer to users to reduce read latency. Backup: take backups from a replica without impacting primary performance. Each use case has different requirements for replication lag tolerance, consistency guarantees, and failover speed, which influences the replication mode and topology chosen.

Synchronous vs Asynchronous Replication

In synchronous replication, the primary waits for at least one replica to acknowledge receipt and application of each write before returning success to the client. This guarantees zero data loss on primary failure — the replica always has the latest committed data. The cost is increased write latency: every write incurs at least one round trip to the replica. Network hiccups or a slow replica directly degrade write performance. In asynchronous replication, the primary commits locally and returns success immediately; replication happens in the background. Write latency is unaffected by replica performance, but a primary failure can result in losing the most recent writes that hadn’t yet replicated. Most systems default to async and offer sync as an option for critical data.

WAL Shipping: PostgreSQL Streaming Replication

PostgreSQL’s streaming replication is physical replication via WAL (Write-Ahead Log) shipping. Every change to the database — inserts, updates, deletes, even vacuums — is first written to the WAL as binary records. The primary streams these WAL records to standbys in real time. Standbys apply the records, maintaining a byte-for-byte copy of the primary’s data files. Because replication is at the storage level, standbys must run the same major PostgreSQL version and the same hardware architecture. Physical replication is simple and highly reliable, but it replicates everything — you cannot replicate a subset of tables or filter by row. Standbys can serve read queries in hot standby mode while continuously applying WAL.

Logical Replication: Row-Level Operations

Logical replication decodes WAL records into row-level operations (INSERT, UPDATE, DELETE) and replicates those operations rather than raw bytes. PostgreSQL’s logical replication uses a publication/subscription model: the publisher defines which tables to publish; subscribers connect and receive the decoded changes. This enables partial replication (replicate only selected tables), cross-version replication (primary on PG 14, replica on PG 15), and even cross-database-type replication via logical decoding plugins. The tradeoff: logical replication requires primary keys or replica identity on tables and has more CPU overhead for decoding. It also does not replicate DDL changes (schema changes must be applied manually to subscribers).

MySQL Binlog: Row, Statement, and Mixed Formats

MySQL replication is driven by the binary log (binlog). Three formats exist. Statement-based replication logs the SQL statement itself — compact, but non-deterministic functions (NOW(), UUID(), RAND()) produce different results on the replica, causing divergence. Row-based replication logs the before and after image of each affected row — safe for non-deterministic statements, but verbose for large batch updates. Mixed format uses statement-based by default and switches to row-based for statements known to be unsafe. Production systems should use row-based replication for safety. Row-based binlog is also required for CDC tools like Debezium.

GTID: Global Transaction Identifiers

GTID (Global Transaction Identifier) assigns a unique ID to every committed transaction across the entire replication topology. Each GTID is a pair: source_uuid:transaction_seq. Replicas track which GTIDs they have applied. On failover, a new primary can be identified and replicas can automatically re-point to it — they simply request transactions they haven’t applied yet, identified by GTID rather than by binlog file name and position. Without GTID, promoting a new primary and re-pointing replicas requires manually finding the correct binlog position, which is error-prone. GTID makes automated failover (via tools like Orchestrator or ProxySQL) reliable and is strongly recommended for any production MySQL replication setup.

Replication Lag: Causes and Mitigation

Replication lag is the delay between a write on the primary and its application on a replica. Causes include: write bursts that overwhelm the replica’s single-threaded apply pipeline, heavy single transactions (large batch deletes), or replica hardware that can’t keep up with the primary. Monitor lag via SHOW SLAVE STATUS (Seconds_Behind_Master in MySQL) or pg_stat_replication in PostgreSQL. Mitigation strategies: parallel replication (apply independent transactions concurrently — MySQL parallel replication per schema, or per transaction with writeset-based parallelism; PostgreSQL parallel apply in PG 16+); ensure replica hardware matches primary; avoid excessively large single transactions that serialize the apply pipeline.

Multi-Master Replication and Conflict Resolution

Multi-master (active-active) replication allows writes to multiple primaries simultaneously. The fundamental challenge is conflict resolution: two clients update the same row on different masters at the same time — which write wins? Common strategies: last-write-wins using a timestamp or version number (simple, but can silently discard writes if clocks skew); application-level conflict detection with manual resolution; operational transforms (complex, used in collaborative editing). The safest approach is to avoid multi-master entirely: use a single-writer primary and route all writes there. Multi-master is justified only for geo-distributed writes where cross-region latency makes a single primary unacceptable, and even then partitioning writes by key range to avoid conflicts is preferable.

Change Data Capture with Debezium

Change Data Capture (CDC) turns the database replication log into a stream of business events. Debezium is the most widely used CDC tool: it acts as a logical replication client (MySQL binlog consumer or PostgreSQL logical replication subscriber), reads every row-level change, and publishes it as a structured event to Kafka. Downstream consumers can build read models, sync search indexes, update caches, or trigger workflows — all from the authoritative change stream without polling the database. CDC enables event-driven architectures without dual-writes. Key operational concerns: ensure the binlog or WAL is retained long enough for Debezium to catch up after downtime; monitor consumer lag; handle schema changes carefully since Debezium tracks schema history.

Failover: Promotion, STONITH, and Coordinators

When the primary fails, a replica must be promoted. The critical risk is split-brain: both the old primary (which may have recovered) and the new primary accept writes simultaneously, causing data divergence. STONITH (Shoot The Other Node In The Head) is the principle: before promoting a replica, positively kill or fence the old primary — via IPMI power-off, cloud API instance stop, or disk fence — so it cannot accept writes. External coordinators automate this: Patroni (PostgreSQL) uses etcd or ZooKeeper for distributed consensus on primary identity and automates promotion with fencing. MHA (MySQL) identifies the most up-to-date replica and promotes it, applying any missing relay log events from other replicas before cutting over. Proper failover automation with fencing is non-negotiable for production HA databases.