Database Replication: Low-Level Design

⏱ 4 min read

Database replication maintains copies of a database on multiple servers for high availability, read scaling, and disaster recovery. Understanding replication modes (synchronous vs. asynchronous), topologies (primary-replica, multi-primary), and failure handling is fundamental for system design. PostgreSQL streaming replication, MySQL binlog replication, and MongoDB replica sets each implement these concepts differently.

Synchronous vs. Asynchronous Replication

Synchronous replication: the primary sends the write to all synchronous replicas and waits for acknowledgment before committing locally. The transaction is not visible to readers until at least one synchronous replica has persisted the change. Zero data loss (RPO = 0) — if the primary fails immediately after committing, the synchronous replica has all data. Downside: write latency = max(primary_commit_time, replica_commit_time). With a replica in another datacenter (20ms RTT), all writes add 20ms. Not acceptable for high-throughput write paths. PostgreSQL: SYNCHRONOUS_COMMIT = on. Asynchronous replication (default): the primary commits locally and sends changes to replicas in the background. Write latency = primary commit latency only. Replica lag: replicas typically lag 0-500ms behind the primary. If the primary fails: data written in the lag window is lost (RPO = replication lag, usually < 1 second). Most systems use asynchronous replication for performance and accept the small RPO. Semi-synchronous: wait for at least one replica to acknowledge, not all. Prevents data loss from primary failure (the replica has the data) without the latency of waiting for all replicas. MySQL’s default semi-sync replication uses this approach.

Replication Mechanics: WAL and Binlog

PostgreSQL streaming replication: the primary writes all changes to the Write-Ahead Log (WAL) before applying them. WAL segments are streamed to replicas in real time. The replica applies WAL records to its own storage — “replaying” the primary’s writes. Replicas are consistent snapshots of the primary at a slightly earlier point in time (the replication lag). Hot standby: replicas can serve read-only queries while replaying WAL. This enables read scaling — direct read-heavy queries to replicas. MySQL binlog replication: the primary records all changes in the binary log (binlog). The replica’s IO thread fetches binlog events from the primary and writes them to the relay log. The SQL thread replays relay log events on the replica’s storage. GTID (Global Transaction ID) replication: each transaction has a globally unique ID. GTID-based replication allows replicas to track exactly which transactions they have applied — enabling accurate failover and replication topology changes without manual binlog file/position tracking. Logical replication (PostgreSQL): instead of WAL bytes, stream higher-level operations (INSERT, UPDATE, DELETE on specific tables). Enables replicating a subset of tables or to different PostgreSQL versions. Used for zero-downtime major version upgrades.

Failover and Promotion

When the primary fails, a replica must be promoted to become the new primary. Automated failover: a high availability manager (Patroni for PostgreSQL, Orchestrator for MySQL, AWS RDS Multi-AZ) monitors the primary with health checks and automatically promotes the most up-to-date replica. Patroni uses a distributed consensus store (etcd or ZooKeeper) for leader election — only one node becomes leader, preventing split-brain. Promotion steps: (1) Detect primary failure (missed heartbeats, connection refused). (2) Leader election among replicas using consensus. (3) Winner is promoted (pg_promote() in PostgreSQL). (4) DNS or load balancer is updated to point to the new primary. (5) Old replicas reconfigure to follow the new primary. Typical failover time: 30-60 seconds for automated failover (health check timeout + election + DNS propagation). During failover: writes are rejected (primary is down). Reads from replicas continue (unless the replica itself is being promoted). Data loss: if using async replication, writes committed to the old primary but not yet replicated are lost. Prevent with synchronous replication for the hot standby replica (synchronous for one replica, async for others).

Read Scaling with Replicas

Read replicas serve read queries, offloading the primary and scaling read throughput. Application-level routing: the application maintains two connection pools — primary (for writes and reads that must be immediately consistent) and replica pool (for reads that tolerate slight staleness). Reads that must see recent writes (immediately after a user action) go to the primary. Analytical queries, reports, and dashboard data go to replicas. Replication lag awareness: a user who just created a post should see it immediately on refresh — but the replica may not have replicated it yet ( N milliseconds. Read replica scaling: add replicas horizontally to handle more read traffic. Each replica adds 100% more read capacity. For extreme read scale: add a caching layer (Redis) in front of replicas — reduce database load further.

Multi-Primary Replication

Multi-primary (multi-master) replication allows writes to any node — all nodes are peers. Use cases: active-active multi-region deployments (write from the nearest region without cross-region write latency). Conflict resolution: two nodes can receive conflicting writes to the same row simultaneously. Resolution strategies: (1) Last-write-wins (LWW): the write with the highest timestamp wins. Loses the other write. Requires synchronized clocks (NTP, hybrid logical clocks). (2) Application-level resolution: both conflicting versions are preserved; the application reconciles them (e.g., shopping cart merge). (3) Operational transformation or CRDTs: data structures designed for conflict-free merging (counters, sets). CockroachDB and YugabyteDB use multi-primary replication with strong consistency via consensus (Raft) — each write is confirmed by a majority of nodes before committing. No conflicts because only one node “wins” the Raft leadership for each partition. Higher write latency (multi-node round trip) but linearizable consistency. Galera Cluster (MySQL): synchronous multi-primary with certification-based conflict detection. Writes are certified against all other concurrent transactions; conflicting ones are rolled back. Suitable for active-active setups where write conflicts are rare.