System Design Interview: Database Replication and High Availability

⏱ 9 min read

Why Database Replication?

A single database server is a single point of failure. Replication copies data from a primary (write) node to one or more replica (read) nodes, serving three goals: high availability (failover when the primary fails), read scalability (distribute SELECT queries across replicas), and geographic locality (place replicas close to users in other regions). Understanding replication internals — what can go wrong, how failover works, and what consistency tradeoffs you accept — is a core system design skill.

How MySQL Binary Log Replication Works

MySQL primary-replica replication has three threads:

Dump thread (primary): reads the binary log (binlog) and sends events to replicas on demand
I/O thread (replica): receives binlog events and writes them to the relay log (a local copy on the replica)
SQL thread (replica): reads the relay log and applies events to the replica’s data

Two replication modes: statement-based (logs SQL statements — compact but non-deterministic for functions like NOW() or RAND()) and row-based (logs before/after row images — larger but deterministic). Mixed mode uses statement-based for safe statements and row-based for non-deterministic ones. Modern deployments use row-based for correctness.

GTID: Global Transaction Identifiers

Traditional file-and-position replication (CHANGE MASTER TO MASTER_LOG_FILE=…, MASTER_LOG_POS=…) is brittle — changing the primary requires knowing the exact log file and position on the new primary. GTIDs (Global Transaction Identifiers) assign each transaction a unique ID: server_uuid:transaction_number (e.g., 3E11FA47-71CA-11E1-9E33-C80AA9429562:23). Every replica tracks which GTIDs it has applied. When a new primary is elected, each replica simply tells the new primary “I have applied up to GTID set X” — no log position arithmetic needed. This makes failover and re-pointing replicas dramatically simpler and less error-prone.

Replication Lag: Causes and Mitigations

Replication lag is the delay between a write on the primary and its appearance on a replica. Common causes: (1) Single-threaded SQL thread: the replica applies one transaction at a time, while the primary runs many concurrently. (2) Long-running transactions on the primary block smaller ones. (3) Network latency between primary and replica data centers.

Mitigation: Parallel replication — MySQL 5.7+ applies transactions from different schemas in parallel; MySQL 8.0 WRITESET-based parallel replication applies non-conflicting transactions concurrently, regardless of schema. Monitor lag with Seconds_Behind_Master (SHOW REPLICA STATUS). For read-your-own-write consistency: after a write, read from the primary for 1-2 seconds, or pass the GTID to the read request and wait for a replica to catch up to that GTID before serving.

Semi-Synchronous Replication

Standard asynchronous replication: the primary commits and returns success before any replica acknowledges receipt. If the primary crashes immediately after commit, the transaction may be lost (replica didn’t receive it yet). Semi-synchronous replication: the primary waits for at least one replica to acknowledge that the transaction was written to its relay log before returning success. The primary does NOT wait for the replica to apply the transaction — only for receipt. This eliminates the data-loss window in exchange for slightly higher write latency (one network round-trip to the nearest replica). PostgreSQL calls this synchronous_commit = remote_write.

Read Replica Routing

With multiple replicas, the application must route reads. Common patterns:

Application-level routing: a connection pool (ProxySQL for MySQL, PgBouncer for PostgreSQL) routes SELECT to replicas and write statements to the primary. The proxy tracks which replicas are healthy and their lag, and removes lagging replicas from the read pool.
Read preference in driver: MongoDB drivers support readPreference: secondary — reads go to replicas by default.
Sharding: for datasets too large for a single primary, each shard has its own primary-replica set. A routing layer (vitess for MySQL, mongos for MongoDB) directs queries to the correct shard.

Failover: Automatic Promotion

When the primary fails, a replica must be promoted. Manual failover takes minutes and requires human intervention. Automated failover tools:

MySQL Orchestrator / MHA: monitors primary health, elects the most up-to-date replica (most advanced GTID), promotes it, and re-points all other replicas to the new primary — typically in 10-30 seconds.
PostgreSQL Patroni: uses etcd or Consul as a distributed lock. The replica that acquires the leader lock promotes itself. Other replicas reconfigure to follow the new leader. Haproxy/pgbouncer re-routes connections based on the leader label in etcd.
AWS RDS Multi-AZ: synchronous standby in another Availability Zone. Automatic failover in 60-120 seconds with minimal data loss (synchronous replication means the standby has all committed data).

During failover, in-flight writes may be lost (async replication) or the application may see connection errors. Design applications to retry writes with idempotency keys.

Multi-Master (Active-Active) Replication

Multi-master replication allows writes on multiple nodes simultaneously — useful for multi-region active-active setups where both regions need low-latency writes. The fundamental problem: if two regions write different values to the same row concurrently, there is a conflict that must be resolved. Strategies:

Last Write Wins (LWW): the write with the higher timestamp wins. Simple but loses data if clocks are skewed (NTP is not precise enough for tight conflicts).
Custom conflict handlers: application-defined resolution (e.g., “higher balance wins” for an account balance).
CRDT (Conflict-free Replicated Data Types): data structures (G-Counter, OR-Set) that are commutative and associative — merging any order of concurrent updates produces the same result. Used in Riak, DynamoDB, and some Redis use cases.
Avoid conflicts by design: route all writes for a given user to a single region. This is the most reliable approach — only cross-region reads are served from local replicas.

PostgreSQL Streaming Replication vs Logical Replication

PostgreSQL streaming replication sends WAL (Write-Ahead Log) segments to standby servers — a physical byte-level copy. The standby is a binary replica of the primary; it cannot have different indexes or table structures. Logical replication decodes WAL into row-level change events (INSERT/UPDATE/DELETE) and streams them. The subscriber can have different schemas, indexes, or even be a different PostgreSQL version. Use logical replication for: zero-downtime major version upgrades, selective table replication, and publishing to non-PostgreSQL consumers.

Interview Checklist

State your replication model: async (possible data loss) vs semi-sync (no data loss, higher latency)
Use GTIDs to simplify failover; avoid file-and-position for new deployments
Add ProxySQL or Patroni for automated routing and failover
Handle replication lag: read-your-own-writes via primary read or GTID wait
Avoid multi-master unless you have a specific conflict resolution strategy — prefer routing writes to one region
For active-active multi-region, consider CockroachDB or Google Spanner which handle distributed consensus natively

Frequently Asked Questions

What is the difference between synchronous and asynchronous replication?

In asynchronous replication (the default), the primary commits a transaction and returns success to the client before any replica has acknowledged receipt. This gives the lowest write latency but risks data loss: if the primary crashes immediately after commit, the transaction may not have reached any replica yet. In synchronous replication (semi-synchronous in MySQL, synchronous_commit = remote_write in PostgreSQL), the primary waits for at least one replica to confirm receipt of the WAL/binlog data before returning success. The transaction cannot be lost — at least one replica has it — but write latency increases by one network round-trip to the nearest replica (typically 1-5ms within a data center, 50-150ms cross-region). AWS RDS Multi-AZ uses synchronous replication to a standby in another Availability Zone, providing near-zero data loss at failover time.

How does automatic failover work in a primary-replica database setup?

Automatic failover requires three components: health monitoring, leader election, and connection re-routing. A monitoring agent (MySQL Orchestrator, PostgreSQL Patroni, or AWS RDS) continuously checks the primary with heartbeat queries. When the primary fails to respond within a configurable threshold (typically 3-10 seconds), the agent initiates failover. Leader election: the replica with the most advanced replication position (highest GTID executed set in MySQL, highest LSN in PostgreSQL) is promoted to primary. Other replicas are re-pointed to the new primary using CHANGE REPLICATION SOURCE (MySQL) or recovery.conf (PostgreSQL). Connection re-routing: a virtual IP or DNS TTL flip redirects application connections to the new primary. HAProxy or ProxySQL can detect the new primary via health checks and re-route within seconds. Total failover time: 10-60 seconds depending on the tool and timeout settings. Applications should implement connection retry logic to handle the brief interruption.

How do you handle read-your-own-write consistency with read replicas?

When a user writes data to the primary and immediately reads from a replica, they may not see their own write because replication lag means the replica is slightly behind. Three strategies: (1) Read from primary for 1-2 seconds after a write: track the write timestamp per user session; route reads to primary if current_time – last_write_time < 2s. Simple but increases primary read load. (2) GTID-based consistency: the primary returns the GTID of the committed transaction. The client passes this GTID to subsequent read requests. The read router waits for a replica to have executed that GTID before serving the read (MySQL WAIT_FOR_EXECUTED_GTID_SET). The read goes to the most caught-up replica that has the required GTID. (3) Sticky sessions: route all requests from a given user to the same replica for a session window. Simple, but defeats the load-balancing purpose if the same replica handles the same large users. Option 2 is the most sophisticated and is used by systems like Vitess (MySQL) and CockroachDB.

Companies That Ask This Question

Databricks Engineering Interview Guide

Stripe Engineering Interview Guide

LinkedIn Engineering Interview Guide

Airbnb Engineering Interview Guide

Asked at: Uber Interview Guide

Asked at: Netflix Interview Guide

Asked at: Twitter/X Interview Guide