Question 1

What is the difference between synchronous and asynchronous database replication?

Accepted Answer

Asynchronous replication: the primary commits a write and responds to the client immediately. The replica receives the change later (milliseconds to seconds). Pros: low write latency, primary is not blocked by slow replicas. Cons: if the primary fails before the replica receives the change, that data is lost. Synchronous replication: the primary waits for at least one replica to confirm it received and persisted the change before responding. Pros: zero data loss on failover. Cons: write latency increases by the network round-trip to the replica (0.5-1ms within a datacenter, 50-200ms cross-region), and the primary is blocked if the replica goes down. Semi-synchronous is the practical middle ground: wait for one replica to acknowledge, but not all. This provides durability without being blocked by a single slow replica.

Question 2

How does replication lag affect application behavior?

Accepted Answer

Replication lag is the delay between a write on the primary and its visibility on a replica. Common impacts: (1) Read-your-own-writes violation -- a user writes a comment, refreshes, and does not see it because the read went to a lagging replica. Fix: route reads for recently-written data to the primary. (2) Monotonic read violations -- a user sees newer data, refreshes, and sees older data because the second read went to a more lagged replica. Fix: pin a user session to a specific replica. (3) Stale search results -- search indexes updated from replicas show older data. Monitor lag as a key metric. PostgreSQL: query pg_stat_replication for write_lag, flush_lag, replay_lag. Alert when lag exceeds your threshold (e.g., 5 seconds).

Question 3

When should you use multi-master replication?

Accepted Answer

Multi-master allows writes on any node, but introduces write conflict complexity. Use multi-master only when: (1) Multi-region active-active is required -- users in each region write to the local database for low latency. (2) High write availability is critical -- any node failure must not block writes. Conflict resolution strategies: last-writer-wins (simple but silently discards data), application-level merge (custom logic per data type), or CRDTs (data structures that merge automatically). Production systems: DynamoDB Global Tables (LWW), CockroachDB (Raft consensus avoids conflicts), Cassandra (tunable consistency). For most applications, primary-replica with automatic failover is sufficient and much simpler. Only choose multi-master when the use case explicitly requires concurrent writes from multiple regions.

Question 4

How do you choose a database replication strategy?

Accepted Answer

Match strategy to requirements: (1) Read scaling, moderate writes -- primary-replica with async replication and 2-3 read replicas. Handle lag at the application level. (2) High availability, single region -- primary-replica with semi-synchronous replication and automatic failover. AWS RDS Multi-AZ automates this. (3) High availability across regions -- async replication to a cross-region read replica. On primary region failure, promote the replica. Accept RPO = replication lag. (4) Active-active multi-region writes -- multi-master with conflict resolution. Accept the complexity. (5) Analytics isolation -- dedicated read replica for heavy reporting queries. In interviews, state the replication mode and justify: We use async primary-replica with 3 read replicas. The application routes the creating user reads to the primary for read-after-write consistency.

System Design: Database Replication — Primary-Replica, Synchronous vs Async, Multi-Master, Read Replicas, Lag

Primary-Replica (Master-Slave) Replication

Synchronous vs Asynchronous Replication

Replication Lag and Its Impact

Multi-Master (Multi-Primary) Replication

Replication Topologies

Choosing a Replication Strategy