Change Data Capture (CDC) Low-Level Design: Real-Time Database Streaming

Change Data Capture (CDC) streams every insert, update, and delete from a database as an ordered event log. Instead of polling for changes or adding triggers to every table, CDC taps the database’s replication log — the same mechanism used for database replicas. This gives you a real-time, complete, ordered stream of mutations with near-zero overhead on the source database.

How CDC Works at the Database Level

PostgreSQL writes every committed change to its Write-Ahead Log (WAL) before applying it. CDC tools like Debezium connect as a PostgreSQL replication client, read the WAL stream, decode row-level changes, and publish them to Kafka. The WAL records: the operation (INSERT/UPDATE/DELETE), the table name, the full before and after state of the row, the LSN (log sequence number — a monotonic position in the log), and the transaction ID.

-- Enable logical replication in PostgreSQL
-- postgresql.conf:
wal_level = logical
max_replication_slots = 4

-- Create a replication slot for Debezium
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');

-- Create publication for tables you want to stream
CREATE PUBLICATION app_publication FOR TABLE orders, payments, users;

Debezium CDC Event Schema

// Kafka message produced by Debezium for an UPDATE to orders table
{
  "before": {
    "order_id": "abc123",
    "status": "pending",
    "updated_at": 1700000000
  },
  "after": {
    "order_id": "abc123",
    "status": "shipped",
    "updated_at": 1700001000
  },
  "source": {
    "db": "appdb",
    "table": "orders",
    "lsn": 12345678,       // log sequence number — position in WAL
    "txId": 9876,
    "ts_ms": 1700001000000
  },
  "op": "u",               // i=insert, u=update, d=delete, r=read (snapshot)
  "ts_ms": 1700001001234   // Debezium processing timestamp
}

Common CDC Use Cases

Cache invalidation: Consume CDC events and delete or update the corresponding Redis key. Eliminates cache staleness from direct DB writes that bypass the application cache layer.

def handle_order_change(event):
    order_id = event['after']['order_id'] if event['op'] != 'd' else event['before']['order_id']
    redis.delete(f"order:{order_id}")
    if event['op'] != 'd':
        redis.setex(f"order:{order_id}", 300, json.dumps(event['after']))

Search index sync: Consume CDC events and update Elasticsearch. Avoids dual-write race conditions where DB write succeeds but ES update fails.

def handle_product_change(event):
    if event['op'] == 'd':
        es.delete(index='products', id=event['before']['product_id'])
    else:
        es.index(index='products', id=event['after']['product_id'],
                 document=map_to_search_doc(event['after']))

Event sourcing / audit log: Every CDC event is a timestamped audit record. Route CDC events to an append-only store (S3, BigQuery) for compliance and historical queries.

Replication to read replicas or data warehouses: Stream CDC events to a read-optimized store (Redshift, BigQuery, ClickHouse) for analytics without loading the OLTP database.

Handling Ordering and Exactly-Once Semantics

CDC events within a single table are ordered by LSN. Events across tables in the same transaction arrive in the same Kafka partition (Debezium uses transaction ID as the partition key within a table). Consumers must track the last processed LSN to resume correctly after a restart.

CREATE TABLE cdc_consumer_offset (
    consumer_group  VARCHAR(100) PRIMARY KEY,
    last_lsn        BIGINT NOT NULL,
    updated_at      TIMESTAMPTZ DEFAULT NOW()
);

def consume_with_checkpoint(consumer_group):
    last_lsn = db.fetchone("SELECT last_lsn FROM cdc_consumer_offset WHERE consumer_group=%s",
                            [consumer_group])['last_lsn'] or 0
    for event in kafka.consume('orders.changes'):
        if event['source']['lsn'] <= last_lsn:
            continue  # already processed — skip
        process_event(event)
        # Update checkpoint atomically with the business operation
        db.execute("""
            INSERT INTO cdc_consumer_offset (consumer_group, last_lsn)
            VALUES (%s, %s)
            ON CONFLICT (consumer_group) DO UPDATE SET last_lsn=EXCLUDED.last_lsn
        """, [consumer_group, event['source']['lsn']])

Key Interview Points

  • CDC reads the replication log — overhead on the source DB is minimal (no triggers, no polling queries). The WAL is written anyway for replication; CDC just reads it.
  • The before-image is critical for cache invalidation and downstream deletes — always configure Debezium with REPLICA IDENTITY FULL on tables where you need the pre-change state.
  • CDC is at-least-once delivery — consumers must be idempotent. Use the LSN as an idempotency key.
  • WAL retention: PostgreSQL retains WAL until all replication slots have consumed it. A stalled CDC consumer will cause WAL accumulation and disk exhaustion. Set a replication slot lag limit and alert on it.
  • Initial snapshot: When first enabling CDC, Debezium does a consistent snapshot of existing rows (emitting "op": "r" read events) before streaming live changes. This ensures downstream systems start with complete data.
  • Schema evolution: CDC events embed the schema at the time of the change. Use Confluent Schema Registry with Avro to manage schema evolution across producers and consumers.

Change data capture and real-time data pipeline design is discussed in Databricks system design interview questions.

Change data capture and data pipeline reliability design is covered in LinkedIn system design interview preparation.

Change data capture and event streaming design is discussed in Uber system design interview guide.

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

Scroll to Top