Data Replication Service Low-Level Design: CDC-Based Sync, Conflict Resolution, and Lag Monitoring

What Is a Data Replication Service?

A data replication service continuously synchronizes data from one or more source databases to one or more targets, enabling use cases like read replicas, analytics offload, disaster recovery, and multi-region active-active deployments. Change Data Capture (CDC) extracts a stream of database changes in real time, avoiding full-table scans. The service must handle conflicts when multiple sources write to the same key, monitor replication lag, and catch up after failures without data loss.

Requirements

Functional Requirements

Capture row-level changes (INSERT, UPDATE, DELETE) from source databases via CDC (binlog, WAL, or Debezium).
Apply changes to one or more target databases with configurable transformation and filtering.
Configurable conflict resolution strategies: last-write-wins, source-priority, custom merge function.
Monitor and expose replication lag per source-target pair.
Catch-up mode: replay accumulated changes efficiently after a target has been offline.

Non-Functional Requirements

End-to-end replication lag under 1 second for real-time use cases under normal load.
Exactly-once delivery guarantees or idempotent apply semantics to prevent duplicate rows.
Horizontal scalability: partition change stream by table or key range across multiple workers.
Alerting when lag exceeds configurable thresholds or when error rate spikes.

Data Model

ReplicationConfig — id, source_db_id, target_db_id, tables (list), conflict_strategy, filter_expression, transform_class.
ChangeEvent — event_id (monotonic), source_db_id, table_name, operation (INSERT/UPDATE/DELETE), pk_value, before_image (JSON), after_image (JSON), source_timestamp, captured_at.
ReplicationOffset — config_id (PK), last_applied_event_id, last_applied_at, lag_ms.
ConflictRecord — event_id, config_id, conflict_type, resolution, resolved_at, resolver_class.

ChangeEvents flow through a durable message queue (Kafka or Kinesis) partitioned by (table, pk_value) to preserve per-key ordering. ReplicationOffset is the consumer group committed offset, persisted to allow resumption after failure.

Core Algorithms

CDC-Based Change Capture

The capture component connects to the source database replication stream (MySQL binlog, PostgreSQL logical replication slot, or via Debezium). It deserializes each binlog event into a ChangeEvent with before and after images, enriches it with a monotonic event_id, and publishes to the Kafka topic partitioned by (table_name, pk_value). Partitioning by primary key guarantees that all changes to a given row are processed in order by the same consumer.

Conflict Resolution

In multi-source scenarios the same row may be updated concurrently on two sources. When the applier detects that an incoming ChangeEvent has a source_timestamp earlier than the last applied event for the same key (a conflict), it invokes the configured conflict strategy. Last-write-wins compares source_timestamps and applies whichever is newer. Source-priority always prefers events from a designated primary source. Custom merge functions receive both before and after images and return a merged after image for deterministic resolution. All conflicts are recorded in ConflictRecord for auditing.

Catch-Up After Lag

When a target comes back online after an outage its ReplicationOffset lags behind the current head of the Kafka topic. The applier reads from the committed offset and applies events as fast as possible, limited only by the target database write throughput. Catch-up mode disables per-event lag alerting and uses a bulk apply path (batched upserts) rather than single-row applies to maximize throughput. Once lag drops below the threshold, the applier transitions back to normal streaming mode.

API Design

POST /replication/configs — Create a replication config specifying source, target, tables, conflict strategy, and filters.
GET /replication/configs/{id}/lag — Returns current lag in milliseconds and last applied event timestamp.
POST /replication/configs/{id}/pause — Pause applying changes to target (capture continues; events accumulate in Kafka).
POST /replication/configs/{id}/resume — Resume applying; triggers catch-up if lagged.
GET /replication/conflicts — List recent conflict events with resolution details.
POST /replication/configs/{id}/resync — Full resync: snapshot source table and reset offset to head.

Scalability and Reliability

Partitioned Apply Workers

Each Kafka partition is consumed by one apply worker. Scaling the number of partitions (and workers) scales apply throughput linearly. Within a partition events are applied sequentially to preserve ordering. Across partitions (different tables or key ranges) events are applied concurrently. This gives fine-grained parallelism without sacrificing per-key ordering guarantees.

Exactly-Once Semantics

The applier uses idempotent upsert operations: INSERT … ON CONFLICT UPDATE using the primary key. If a worker crashes and restarts, it re-reads from the last committed ReplicationOffset. Re-applied events produce the same result as the original apply because upserts are idempotent. Kafka offset commits happen after the target database transaction commits, ensuring the offset is never advanced past an event that was not successfully applied.

Lag Monitoring and Alerting

Lag is computed as the difference between the current wall clock time and the source_timestamp of the last applied ChangeEvent. The service publishes lag_ms to a metrics system (Prometheus or CloudWatch) per config_id. Alerts fire when lag exceeds a configurable threshold (e.g., 5 seconds for real-time configs, 60 seconds for analytics configs). A heartbeat event is injected into the CDC stream every 10 seconds so lag can be measured even when there are no real changes.

Interview Tips

Key discussion points: how do you handle schema changes on the source during replication? (schema registry with versioned event schemas; applier must handle schema evolution gracefully). How do you resync a target that has diverged due to a bug in conflict resolution? (resync: snapshot source at a consistent point, load into target, reset offset to the snapshot LSN). What is the difference between replication lag and apply lag? (replication lag is end-to-end from source commit to target apply; apply lag is the applier processing queue depth — separating them helps diagnose whether the bottleneck is capture, transport, or apply).

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does CDC-based change capture work in a data replication system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Change Data Capture (CDC) reads the database's write-ahead log (WAL) or binlog to stream every insert, update, and delete as an ordered event. A CDC connector (e.g., Debezium) publishes these events to a durable message bus. Downstream replicas consume the stream and apply changes in order, keeping an exact logical copy without polling or full table scans.”
}
},
{
“@type”: “Question”,
“name”: “What conflict resolution strategies are used when replication targets can accept writes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Common strategies include last-write-wins (LWW) using a logical timestamp or wall clock, application-level merge functions for CRDTs, and explicit version vectors that let the application decide. For most OLTP workloads LWW with a monotonic hybrid logical clock (HLC) is sufficient; richer data types like counters or sets benefit from CRDT-based merge to avoid data loss on concurrent writes.”
}
},
{
“@type”: “Question”,
“name”: “How do you monitor replication lag and alert on it?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Replication lag is measured as the difference between the source's latest commit timestamp and the most recently applied event timestamp on the replica. This metric is exported from the CDC connector and scraped by Prometheus. Alerts fire when lag exceeds an SLO threshold (e.g., 30 seconds for near-real-time pipelines). A heartbeat row written periodically to the source provides an end-to-end lag signal even when the source is idle.”
}
},
{
“@type”: “Question”,
“name”: “How does a replica catch up after falling significantly behind or recovering from a snapshot?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The catch-up mechanism starts with a consistent base snapshot of the source (logical dump or storage-level snapshot) loaded into the replica. The CDC stream is replayed from the log sequence number (LSN) recorded at snapshot time, replaying only the delta. Backpressure is applied to the consumer to avoid overwhelming the replica during replay, and the replica is promoted to production traffic only once lag drops below the acceptable threshold.”
}
}
]
}