Q: How does the read-your-writes problem manifest and when does it matter?

Read-your-writes consistency: after a user performs a write, any subsequent read they perform should see that write. Without it: a user updates their profile photo, refreshes the page, and sees the old photo (the read routed to a replica that hasn't yet replicated the write). This causes user-visible inconsistency and perceived bugs. When it matters most: (1) immediately after user-initiated writes (profile updates, settings changes, form submissions); (2) after payment or checkout (user should immediately see their order confirmed); (3) after sending a message in a chat app. When it doesn't matter: (1) reads of other users' data (I don't care if a global leaderboard is 2 seconds stale); (2) analytics dashboards (minutes of lag is fine); (3) content feeds (a slightly stale feed is invisible to the user). Route accordingly: apply the timestamp freshness check only for requests in the same user session that had recent writes.

Q: How do connection pools interact with read replica routing?

Each replica maintains its own connection pool (ThreadedConnectionPool in psycopg2, PgBouncer externally). When routing to a replica, the router calls replica.pool.getconn() to obtain a connection from that specific pool. Connection pool sizing: a primary pool of 20 connections + 2 replica pools of 20 connections = 60 total connections to Postgres. Postgres defaults to max_connections=100 — a fleet of 20 application servers each with a 60-connection pool would exceed this limit. Use PgBouncer as a connection multiplexer in front of each Postgres node: application connects to PgBouncer on port 5432; PgBouncer maintains a small pool (10–20) of actual Postgres connections and queues requests. Application-side pools can be large; Postgres only sees PgBouncer's small pool. This pattern is standard in production deployments.

Q: When should you route a "read" query to the primary instead of a replica?

Route to primary when: (1) the query is part of a transaction that also writes (SELECT ... FOR UPDATE, BEGIN...COMMIT with mixed reads and writes); (2) the query implements read-your-writes (recent write in this session); (3) the query reads data that must be absolutely current (real-time inventory check before purchase, live balance display before a debit); (4) all replicas are lagging beyond the acceptable threshold (the router falls back to primary automatically). Route to replica for everything else: analytics queries, reporting, search, list views, any read where a few seconds of staleness is imperceptible. A common mistake: routing ALL reads to the primary "to be safe." This eliminates the capacity benefit of replicas entirely. Profile your query mix: typically 80–95% of queries are safe to route to replicas.

Q: How do you handle the scenario where the primary fails and a replica is promoted?

Primary failure requires: (1) detecting the failure (health check fails 3 times within 15 seconds); (2) selecting the replica with the lowest replication lag as the new primary (to minimize data loss); (3) promoting the replica (pg_promote() in Postgres, or via Patroni/pg_auto_failover); (4) updating the application's routing configuration to send writes to the new primary; (5) demoting the old primary to replica status (if it recovers). In the application layer, the ReadReplicaRouter must be notified of the promotion — either via database configuration (update DbNode.role) or via ZooKeeper/etcd where the HA manager writes the current primary endpoint. Application connections to the old primary will fail; connection retry logic (with exponential backoff) reconnects to the new primary. Data loss window: the max replication lag at the time of failure. With synchronous replication (synchronous_commit=on), zero data loss but higher write latency.

Question 1

What is replication lag and what causes it to increase?

Accepted Answer

Replication lag is the delay between a write being committed on the primary and that same write being applied on a replica. In Postgres streaming replication: the primary writes to its WAL (write-ahead log), the replica's WAL receiver fetches it over the network, and the replica's recovery process applies it. Lag increases when: (1) the replica is CPU-bound applying changes (single-threaded WAL replay in Postgres <14; parallel apply in 14+); (2) the network between primary and replica is slow or congested; (3) a long-running query on the replica holds a snapshot, preventing vacuum and causing the primary to retain WAL until the replica's query finishes (lock conflicts); (4) the replica is underprovisioned relative to the primary's write rate. Monitor pg_stat_replication on the primary: SELECT client_addr, write_lag, flush_lag, replay_lag FROM pg_stat_replication. Alert when replay_lag exceeds your SLA.

Question 2

How does the read-your-writes problem manifest and when does it matter?

Accepted Answer

Read-your-writes consistency: after a user performs a write, any subsequent read they perform should see that write. Without it: a user updates their profile photo, refreshes the page, and sees the old photo (the read routed to a replica that hasn't yet replicated the write). This causes user-visible inconsistency and perceived bugs. When it matters most: (1) immediately after user-initiated writes (profile updates, settings changes, form submissions); (2) after payment or checkout (user should immediately see their order confirmed); (3) after sending a message in a chat app. When it doesn't matter: (1) reads of other users' data (I don't care if a global leaderboard is 2 seconds stale); (2) analytics dashboards (minutes of lag is fine); (3) content feeds (a slightly stale feed is invisible to the user). Route accordingly: apply the timestamp freshness check only for requests in the same user session that had recent writes.

Question 3

How do connection pools interact with read replica routing?

Accepted Answer

Each replica maintains its own connection pool (ThreadedConnectionPool in psycopg2, PgBouncer externally). When routing to a replica, the router calls replica.pool.getconn() to obtain a connection from that specific pool. Connection pool sizing: a primary pool of 20 connections + 2 replica pools of 20 connections = 60 total connections to Postgres. Postgres defaults to max_connections=100 — a fleet of 20 application servers each with a 60-connection pool would exceed this limit. Use PgBouncer as a connection multiplexer in front of each Postgres node: application connects to PgBouncer on port 5432; PgBouncer maintains a small pool (10–20) of actual Postgres connections and queues requests. Application-side pools can be large; Postgres only sees PgBouncer's small pool. This pattern is standard in production deployments.

Question 4

When should you route a "read" query to the primary instead of a replica?

Accepted Answer

Route to primary when: (1) the query is part of a transaction that also writes (SELECT ... FOR UPDATE, BEGIN...COMMIT with mixed reads and writes); (2) the query implements read-your-writes (recent write in this session); (3) the query reads data that must be absolutely current (real-time inventory check before purchase, live balance display before a debit); (4) all replicas are lagging beyond the acceptable threshold (the router falls back to primary automatically). Route to replica for everything else: analytics queries, reporting, search, list views, any read where a few seconds of staleness is imperceptible. A common mistake: routing ALL reads to the primary "to be safe." This eliminates the capacity benefit of replicas entirely. Profile your query mix: typically 80–95% of queries are safe to route to replicas.

Question 5

How do you handle the scenario where the primary fails and a replica is promoted?

Accepted Answer

Primary failure requires: (1) detecting the failure (health check fails 3 times within 15 seconds); (2) selecting the replica with the lowest replication lag as the new primary (to minimize data loss); (3) promoting the replica (pg_promote() in Postgres, or via Patroni/pg_auto_failover); (4) updating the application's routing configuration to send writes to the new primary; (5) demoting the old primary to replica status (if it recovers). In the application layer, the ReadReplicaRouter must be notified of the promotion — either via database configuration (update DbNode.role) or via ZooKeeper/etcd where the HA manager writes the current primary endpoint. Application connections to the old primary will fail; connection retry logic (with exponential backoff) reconnects to the new primary. Data loss window: the max replication lag at the time of failure. With synchronous replication (synchronous_commit=on), zero data loss but higher write latency.

Read Replica Routing System Low-Level Design: Lag-Aware Routing, Read-Your-Writes, and Failover

Read Replica Routing System: Low-Level Design

Core Data Model

Router Implementation

Read-Your-Writes Pattern

Replica Promotion on Primary Failure

Key Design Decisions