Read Scaling Architecture Low-Level Design: Read Replicas, Read-Through Cache, CQRS, and Query Routing

Read Scaling Overview

Read scaling addresses the most common bottleneck in production systems: reads vastly outnumber writes (often 10:1 to 100:1 or more), and a single database primary cannot serve that volume without degradation. The solution is to distribute reads across multiple independent components, each optimized for a different access pattern and consistency requirement. The four primary techniques are: read replicas (physically separate read databases), caching (Redis or Memcached in front of replicas), CQRS read models (purpose-built query stores), and CDN (static and semi-static content). A mature system combines all four.

Read Replica Scaling

Each read replica is a full copy of the primary database, receiving WAL or binlog changes continuously. Reads are distributed across replicas via a router that knows each replica's capacity weight and current lag. Adding a new replica scales read throughput linearly — a cluster with four replicas can serve roughly four times the read QPS of a single replica. The router must health-check replicas continuously: a lagging or unreachable replica is removed from the pool until it recovers.

Replica Lag Budget

Every application has an implicit or explicit staleness tolerance. For a social feed, 5-second-old data is acceptable. For a payment balance, stale data may cause overdrafts. The lag budget defines the maximum acceptable replication lag. The router measures each replica's apply_lsn against the primary's current LSN; replicas with lag exceeding the budget are excluded from the read pool. If all replicas exceed the budget (e.g., during a replication storm), reads fall back to the primary.

Cache Layer Integration

A read-through cache (Redis, Memcached) in front of the replica pool avoids replica I/O for frequently accessed data. On a cache hit, the DB is not touched at all. On a cache miss, the query is routed to a replica, the result is cached with a TTL, and subsequent reads within the TTL are served from cache. TTL must be tuned per entity type: user profile (60s), product catalog (300s), real-time price (0 — do not cache). Cache invalidation on write is handled by the application (delete or update the cache key on primary write) or by a write-behind cache-invalidation stream.

CQRS Read Models

CQRS (Command Query Responsibility Segregation) separates the write model (commands, normalized schema, primary DB) from the read model (queries, denormalized, purpose-built store). The read model is maintained by consuming an event stream from the primary (Kafka, Kinesis, Debezium CDC). Common read store technologies:

Elasticsearch: full-text search, faceted aggregations, nested JSON queries.
DynamoDB: key-value lookups with predictable single-digit-millisecond latency at any scale.
ClickHouse / BigQuery: analytical queries over billions of rows.
Redis: pre-computed leaderboards, session data, small hot datasets.

The CQRS read model is eventually consistent with the primary by design. The staleness window equals the event stream processing lag — typically milliseconds to seconds for streaming pipelines.

Query Routing Logic

The routing tier selects the data source based on the query's consistency requirement and type:

Cache: check first; serve if hit and TTL not expired.
Read replica: serve if the consistency level is eventual and at least one replica is within lag budget.
Primary: serve if the consistency level is strong, or if all replicas are lagging, or if the query type requires it (write-after-read, read-your-writes).

A CQRS read model is routed to when the query type matches the read store (e.g., full-text search routes to Elasticsearch regardless of replica availability).

Read Amplification Monitoring

Key metrics to track for read scaling health:

Cache hit rate: target 90%+ for heavily cached entities; below 70% indicates TTL misconfiguration or cache capacity issues.
Replica QPS distribution: all replicas should receive roughly equal load (within capacity-weighted ranges); imbalance indicates router misconfiguration.
Primary read QPS: should trend toward zero for cacheable workloads; rising primary read QPS indicates cache or replica pool degradation.
p99 query latency: per tier (cache, replica, primary); cache should be sub-millisecond, replica under 10ms, primary varies by query complexity.

SQL Schema

CREATE TABLE ReadReplicaConfig (
    id          BIGSERIAL    PRIMARY KEY,
    replica_id  VARCHAR(64)  NOT NULL UNIQUE,
    region      VARCHAR(32)  NOT NULL,
    weight      INT          NOT NULL DEFAULT 1,
    max_lag_ms  INT          NOT NULL DEFAULT 5000,
    status      VARCHAR(16)  NOT NULL DEFAULT 'active'
);

CREATE TABLE ReadRoutingDecision (
    id           BIGSERIAL    PRIMARY KEY,
    query_hash   VARCHAR(64)  NOT NULL,
    routed_to    VARCHAR(32)  NOT NULL,  -- 'cache', 'replica', 'primary'
    cache_hit    BOOLEAN      NOT NULL,
    replica_id   VARCHAR(64),
    latency_ms   INT          NOT NULL,
    decided_at   TIMESTAMPTZ  NOT NULL DEFAULT now()
);

CREATE INDEX idx_rrd_hash_ts ON ReadRoutingDecision (query_hash, decided_at DESC);
CREATE INDEX idx_rrd_routed  ON ReadRoutingDecision (routed_to, decided_at DESC);

Python Implementation

import time
import hashlib
import random
from typing import Any, Dict, List, Optional

class Replica:
    def __init__(self, replica_id: str, weight: int = 1, max_lag_ms: int = 5000):
        self.replica_id = replica_id
        self.weight     = weight
        self.max_lag_ms = max_lag_ms
        self.current_lag_ms: int  = 0
        self.healthy:        bool = True

class ReadScalingRouter:
    def __init__(self, replicas: List[Replica], cache_ttl: int = 60):
        self.replicas  = replicas
        self.cache_ttl = cache_ttl
        self._cache: Dict[str, Any] = {}
        self._cache_ts: Dict[str, float] = {}

    def _cache_key(self, sql: str) -> str:
        return hashlib.md5(sql.encode()).hexdigest()

    def _cache_get(self, key: str) -> Optional[Any]:
        if key in self._cache:
            if time.time() - self._cache_ts[key]  Optional[Replica]:
        eligible = [
            r for r in self.replicas
            if r.healthy and r.current_lag_ms <= lag_threshold_ms
        ]
        if not eligible:
            return None
        # Weighted random selection
        total  = sum(r.weight for r in eligible)
        pick   = random.uniform(0, total)
        cumul  = 0
        for r in eligible:
            cumul += r.weight
            if pick  Dict:
        """
        Route a read query.
        consistency_level: 'eventual' (replica ok) or 'strong' (primary only).
        Returns a dict with routing decision metadata.
        """
        key = self._cache_key(sql)

        # Step 1: try cache
        cached = self._cache_get(key)
        if cached is not None:
            return {'source': 'cache', 'cache_hit': True, 'result': cached}

        # Step 2: try replica for eventual reads
        if consistency_level == 'eventual':
            replica = self.select_replica(lag_threshold_ms=5000)
            if replica:
                result = self._execute_on_replica(sql, replica)
                self.update_cache_on_read(key, result)
                return {'source': 'replica', 'replica_id': replica.replica_id,
                        'cache_hit': False, 'result': result}

        # Step 3: fall back to primary
        result = self._execute_on_primary(sql)
        self.update_cache_on_read(key, result)
        return {'source': 'primary', 'cache_hit': False, 'result': result}

    def _execute_on_replica(self, sql: str, replica: Replica) -> Any:
        # Placeholder: replace with real DB driver call
        return f"result_from_{replica.replica_id}"

    def _execute_on_primary(self, sql: str) -> Any:
        return "result_from_primary"

Frequently Asked Questions

How many read replicas do I need?

A rough formula: target_read_qps / (single_replica_qps x safety_factor). If a single replica handles 5,000 QPS and you need 40,000 QPS with a 2x safety buffer, you need 40,000 / (5,000 / 2) = 16 replicas. In practice, cache hit rate dramatically reduces the replica count needed — at 95% cache hit rate, you only need replicas to handle 5% of the query volume. Instrument cache hit rate before sizing replicas.

Should I use cache or read replicas for read scaling?

Use cache first — it is cheaper and faster. A Redis cache serves 100,000+ QPS from a single node at sub-millisecond latency. A read replica handles 5,000-20,000 QPS at 1-10ms latency. Replicas are needed for queries that cannot be cached (low-hit-rate queries, highly dynamic data, queries requiring freshness guarantees). CQRS read models handle queries that require a fundamentally different data model than the primary schema (full-text search, aggregations).

How stale can CQRS read model data be?

Staleness depends on the event stream processing pipeline. With a Kafka consumer and a well-tuned consumer lag, staleness is typically 100ms to 2 seconds. Batch-based pipelines (hourly ETL) produce much higher staleness. The acceptable staleness window is a business decision: for a product catalog read model, minutes are fine; for an inventory read model used to prevent overselling, staleness must be below the checkout transaction latency (milliseconds).

What is the correct priority order for read routing?

The standard priority is: (1) cache — fastest, cheapest, no DB load; (2) CQRS read model — when the query type requires a specialized store; (3) read replica — for general SQL queries with eventual consistency tolerance; (4) primary — only for strong consistency requirements or when all replicas are unhealthy or lagging beyond the budget. Routing directly to the primary for cacheable queries is the most common read scaling anti-pattern.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How many read replicas do I need?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Estimate: target_read_qps / (single_replica_qps / safety_factor). At 95% cache hit rate you only need replicas for 5% of query volume. Always instrument cache hit rate before sizing replicas — it dramatically reduces the number needed.”
}
},
{
“@type”: “Question”,
“name”: “Should I use cache or read replicas for read scaling?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use cache first — Redis serves 100,000+ QPS at sub-millisecond latency from a single node. Replicas handle queries that cannot be cached. CQRS read models handle queries requiring a different data model (full-text search, aggregations) than the primary schema.”
}
},
{
“@type”: “Question”,
“name”: “How stale can CQRS read model data be?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “With Kafka streaming, staleness is typically 100ms to 2 seconds. Batch ETL pipelines produce minutes to hours of staleness. Acceptable staleness is a business decision — inventory read models used to prevent overselling need sub-second freshness; product catalog models can tolerate minutes.”
}
},
{
“@type”: “Question”,
“name”: “What is the correct priority order for read routing?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Priority: (1) cache, (2) CQRS read model for specialized queries, (3) read replica for eventual consistency, (4) primary for strong consistency. Routing cacheable queries directly to the primary is the most common read scaling anti-pattern.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do read replicas scale read throughput?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Read replicas receive a stream of changes from the primary via replication and serve SELECT queries independently, so adding replicas multiplies the cluster's total read capacity without touching the primary. The application or a proxy (e.g., ProxySQL, RDS Proxy) routes read queries across replicas using round-robin or least-connections load balancing.”
}
},
{
“@type”: “Question”,
“name”: “How does a read-through cache reduce replica load?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In a read-through cache, the application always reads from the cache (e.g., Redis or Memcached); on a miss the cache fetches the data from the database replica, stores it, and returns it — subsequent reads for the same key never reach the database. This absorbs hot-spot read traffic and reduces replica CPU and I/O, allowing fewer replicas to handle the same workload.”
}
},
{
“@type”: “Question”,
“name”: “How does CQRS separate read and write scaling concerns?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “CQRS (Command Query Responsibility Segregation) maintains separate models for writes (commands) and reads (queries), allowing each to be optimised and scaled independently. The write side can use a normalised OLTP store, while the read side can use denormalised projections, search indexes, or caches that are rebuilt from events, so read traffic never competes with write throughput.”
}
},
{
“@type”: “Question”,
“name”: “How is replica lag handled when reading from follower replicas?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Applications that need read-your-own-writes consistency can use sticky routing to direct a user's reads to the primary for a short window after a write, or they can send reads to a replica only after confirming that replica's applied LSN is at least as high as the write's LSN. For truly stale-tolerant reads, the application simply accepts eventual consistency and communicates lag bounds to end users.”
}
}
]
}