Read Replica Routing — Low-Level Design
Read replica routing directs read queries to replica databases and write queries to the primary, scaling read throughput without adding write capacity. The key challenges are replication lag, connection management, and graceful failover. This design is asked at Amazon, Netflix, and any high-traffic service with a read-heavy workload.
Replication Architecture
Primary DB (single writer)
↓ async streaming replication (PostgreSQL WAL shipping)
↓
Replica 1 (read-only) ← lag: typically 10-100ms
Replica 2 (read-only) ← lag: typically 10-100ms
Replica 3 (read-only) ← configured as hot standby for failover
Writes: always → Primary
Reads: most queries → Replica 1 or 2 (load balanced)
lag-sensitive reads → Primary
Application-Level Routing
class DatabaseRouter:
def __init__(self):
self.primary = create_engine(PRIMARY_DB_URL, pool_size=20)
self.replicas = [
create_engine(REPLICA_1_URL, pool_size=20),
create_engine(REPLICA_2_URL, pool_size=20),
]
self._replica_idx = 0
def writer(self):
return self.primary
def reader(self, require_fresh=False):
if require_fresh:
return self.primary # Bypass replica for lag-sensitive reads
# Round-robin across replicas
engine = self.replicas[self._replica_idx % len(self.replicas)]
self._replica_idx += 1
return engine
db = DatabaseRouter()
# Usage:
def get_user_profile(user_id):
return db.reader().execute("SELECT * FROM User WHERE id=%(id)s", {'id': user_id})
def get_user_after_update(user_id):
# Must read own write — use primary
return db.reader(require_fresh=True).execute("SELECT * FROM User WHERE id=%(id)s", {'id': user_id})
def update_user(user_id, data):
return db.writer().execute("UPDATE User SET ...", data)
Read-Your-Writes Consistency
-- Problem: user updates their profile (write to primary), then
-- immediately fetches their profile (read from replica).
-- The replica may not have the update yet (replication lag).
-- The user sees their old profile. Confusing.
-- Solution 1: Read from primary for a short window after a write
def update_and_read(user_id, updates):
db.writer().execute("UPDATE User SET ... WHERE id=%(id)s", updates)
# Store a flag: "this user wrote recently, route reads to primary"
redis.setex(f'wrote_recently:{user_id}', 10, '1') # 10-second window
def get_user(user_id):
if redis.exists(f'wrote_recently:{user_id}'):
return db.reader(require_fresh=True).execute(...)
return db.reader().execute(...)
-- Solution 2: Pass the primary's LSN (log sequence number) with the write response
-- The reader waits for the replica to reach that LSN before querying.
-- PostgreSQL: pg_current_wal_lsn() on write, pg_last_wal_replay_lsn() on replica.
Lag Monitoring and Replica Health
def check_replica_lag():
"""Monitor replication lag; remove lagging replicas from the pool."""
primary_lsn = db.writer().execute(
"SELECT pg_current_wal_lsn()"
).scalar()
for i, replica in enumerate(db.replicas):
try:
replica_lsn = replica.execute(
"SELECT pg_last_wal_replay_lsn()"
).scalar()
lag_bytes = pg_lsn_diff(primary_lsn, replica_lsn)
lag_mb = lag_bytes / (1024 * 1024)
if lag_mb > 100: # 100MB lag threshold
log.warning(f'Replica {i} lag: {lag_mb:.1f}MB — marking unhealthy')
mark_replica_unhealthy(i)
else:
mark_replica_healthy(i)
except Exception:
mark_replica_unhealthy(i)
def reader(self, require_fresh=False):
if require_fresh:
return self.primary
healthy = [r for i, r in enumerate(self.replicas) if is_healthy(i)]
if not healthy:
return self.primary # Fallback: all replicas unhealthy → use primary
return healthy[self._replica_idx % len(healthy)]
Connection Pooling Strategy
-- Each app server opens connections to both primary and replicas
-- Without pooling: N app servers × (primary + 2 replicas) × pool_size connections
-- 100 servers × 3 DBs × 20 = 6,000 connections — PostgreSQL limit is ~500-1000
-- Solution: PgBouncer as a connection pooler in front of each DB
-- App servers connect to PgBouncer (cheap)
-- PgBouncer maintains a small pool to actual PostgreSQL (~50 connections)
Primary: app_servers → PgBouncer (primary) → PostgreSQL primary (50 conns)
Replica: app_servers → PgBouncer (replica) → PostgreSQL replica (50 conns)
-- Transaction-mode pooling: PgBouncer assigns a DB connection only
-- for the duration of a transaction, then returns it to the pool.
-- This reduces required connections by 10-50×.
Key Interview Points
- Replication lag is the fundamental tradeoff: Async replication is fast but introduces eventual consistency. Any read that must see the latest write (read-your-writes, payment confirmation) must go to the primary. Route all other reads to replicas.
- Fallback to primary when all replicas are unhealthy: The router must gracefully degrade. If all replicas are lagging badly or down, route reads to primary rather than returning errors. This trades performance for availability.
- PgBouncer is essential at scale: PostgreSQL has a hard limit on connections (~500 for typical configs). Without a pooler, N app servers × pool_size quickly exhausts this limit. PgBouncer multiplexes thousands of application connections over ~50 database connections.
- Analytics queries must never run on the primary: A full-table-scan analytics query on the primary can spike CPU and slow down production writes. Dedicated analytics replicas (or a data warehouse copy) are the right solution for batch/BI queries.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you route read queries to replicas without application-level changes?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use a connection proxy like PgBouncer or ProxySQL that sits between your application and the database. The application connects to one endpoint; the proxy inspects each query and routes SELECT statements to read replicas and writes to the primary. PgBouncer with read/write splitting requires a PostgreSQL-aware proxy like Pgpool-II or an application-level library. In practice, most teams use the ORM’s built-in read replica support: Django’s DATABASE_ROUTERS, Rails’s connects_to, or explicit read/write connection pool selection in the application. Explicit routing in the application is more transparent and easier to debug than proxy-level routing.”}},{“@type”:”Question”,”name”:”How do you handle replication lag to prevent stale reads after writes?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Replication lag is typically 10-100ms but can spike under load. The session consistency approach: after a write, store the write’s LSN (log sequence number) in the user’s session. Before a subsequent read for the same user, check if the replica’s pg_last_wal_replay_lsn() has caught up to the stored LSN. If not, route that specific read to the primary. This gives read-your-writes consistency for the user who just wrote, while all other users get replica reads. Alternative: after a write, set a short flag in Redis (e.g., SETEX user:{id}:wrote_recently 5 1) and route all reads for that user to the primary for the next 5 seconds.”}},{“@type”:”Question”,”name”:”Which queries must always go to the primary?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Three categories: (1) Writes (INSERT, UPDATE, DELETE, DDL) — always primary. (2) Reads that must reflect the latest committed data: immediately-after-write reads in the same transaction, reads that determine business logic for subsequent writes (e.g., SELECT FOR UPDATE, SELECT balance before deducting). (3) Reads inside an explicit transaction — PostgreSQL replicas don’t support writable transactions; any transaction that mixes reads and writes must go to the primary. A simple rule: if consistency after a write matters for correctness (not just UX), use the primary. If a slightly stale read would be acceptable (profile page, product catalog), use a replica.”}},{“@type”:”Question”,”name”:”How do you monitor replica lag and when do you failover?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Monitor replication lag via: SELECT EXTRACT(EPOCH FROM (NOW() – pg_last_xact_replay_timestamp())) AS lag_seconds on each replica. Alert if lag > 30 seconds. At lag > 60 seconds, consider removing the lagging replica from the rotation to prevent very stale reads. Use a health check endpoint that your load balancer polls: the replica returns 200 only if lag < configurable threshold (default 10s). Cloud managed databases (RDS, Cloud SQL) expose replication lag as a CloudWatch/Stackdriver metric. For automatic failover (promoting a replica to primary on primary failure), use Patroni (Postgres HA manager) which handles leader election via etcd or Consul.”}},{“@type”:”Question”,”name”:”How does connection pooling interact with read replicas?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Maintain separate connection pools for the primary and each replica. A combined pool of 50 connections pointing at both is wrong — if all 50 connections are assigned to the primary, replicas are unused; if all go to replicas, writes fail. Use: primary_pool (max 20 connections), replica_pool (max 100 connections, distributed across N replicas). The large replica pool handles the high volume of reads; the smaller primary pool handles only writes and consistency-sensitive reads. PgBouncer supports this with separate config blocks per database. Size each pool based on: (active DB connections available) / (number of app instances), targeting 80% utilization.”}}]}
Read replica routing and database scaling design is discussed in Google system design interview questions.
Read replica routing and database load distribution is covered in Amazon system design interview preparation.
Read replica routing and database scaling for high traffic is discussed in Netflix system design interview guide.