Low Level Design: Cross-Datacenter Sync Service

What Is a Cross-Datacenter Sync Service?

A cross-datacenter sync service coordinates state between two or more physical data centers that may be owned by different teams, run different database engines, or operate under different compliance regimes. Unlike pure replication (which copies every write), a sync service is change-data-driven: it detects deltas, resolves semantic conflicts, and applies targeted patches so that each datacenter converges to a consistent view. Typical use cases include active-active database meshes, hybrid-cloud migrations, and federated identity systems.

Data Model and Schema

sync_checkpoint — tracks the last successfully synced position for each datacenter pair:

CREATE TABLE sync_checkpoint (
  source_dc    VARCHAR(32)  NOT NULL,
  target_dc    VARCHAR(32)  NOT NULL,
  entity_type  VARCHAR(64)  NOT NULL,
  last_seq     BIGINT       NOT NULL DEFAULT 0,
  last_sync_at DATETIME(6)  NOT NULL,
  PRIMARY KEY (source_dc, target_dc, entity_type)
);

change_event — normalized change record emitted by CDC (Change Data Capture) connectors:

CREATE TABLE change_event (
  event_id     CHAR(36)     NOT NULL,
  source_dc    VARCHAR(32)  NOT NULL,
  entity_type  VARCHAR(64)  NOT NULL,
  entity_id    VARCHAR(128) NOT NULL,
  operation    ENUM('INSERT','UPDATE','DELETE') NOT NULL,
  before_image JSON,
  after_image  JSON         NOT NULL,
  seq          BIGINT       NOT NULL,
  emitted_at   DATETIME(6)  NOT NULL,
  PRIMARY KEY (event_id),
  INDEX idx_seq (source_dc, entity_type, seq)
);

conflict_policy — per-entity-type resolution rules:

CREATE TABLE conflict_policy (
  entity_type  VARCHAR(64)  NOT NULL,
  strategy     VARCHAR(32)  NOT NULL,
  priority_dc  VARCHAR(32),
  custom_fn    VARCHAR(128),
  PRIMARY KEY (entity_type)
);

Core Algorithm and Workflow

  1. CDC capture: Debezium (or an equivalent connector) tails the binlog/WAL of each source database and emits normalized change_event records onto a shared message bus.
  2. Deduplication and ordering: Each consumer reads events for a given (source_dc, entity_type) partition in strict seq order. The sync_checkpoint table persists the high-water mark so restarts resume without re-processing old events.
  3. Schema translation: A pluggable transformer maps source field names and types to the target schema, handling nullable columns and type widening automatically.
  4. Conflict detection: Before applying an UPDATE, the service compares the event's before_image with the current target row. Divergence triggers the policy from conflict_policy: last-writer-wins, source-DC-wins, field-level merge, or custom function invocation.
  5. Atomic apply: Events are applied inside a target-DB transaction. On success, the checkpoint advances. On failure, the event is sent to a dead-letter queue for operator review.

Failure Handling

Split-brain across DCs: When connectivity between DCs drops, each continues accepting local writes. On reconnect, the service replays the backlog from the last checkpoint, applying conflict policies to every divergent record. A merge summary report is generated for audit.

Partial writes: Idempotency is enforced by storing event_id in a deduplication table on the target side. Re-delivered events are silently skipped if already applied.

Network partitions: The message bus (Kafka) retains events for a configurable retention window (default 7 days). Consumers reconnect and resume from their last checkpoint. Lag alerts fire when a consumer falls more than N seconds behind the producer.

Schema drift: A schema registry enforces forward and backward compatibility. Incompatible changes are rejected at publish time, preventing silent data corruption at consumers.

Scalability Considerations

  • Parallelism: Each (entity_type, shard) combination maps to an independent Kafka partition and consumer thread, enabling linear throughput scaling.
  • Batching: The target writer accumulates up to 500 events per batch and issues a single bulk-upsert, reducing round-trip overhead by an order of magnitude.
  • Throttling: A token-bucket rate limiter on the target writer protects production databases from sync storms during catch-up bursts after an outage.
  • Observability: Per-DC lag, conflict rate, and dead-letter queue depth are exported to a central metrics store and visualized on a real-time dashboard.

Summary

A cross-datacenter sync service turns raw change streams into reliable, conflict-aware state convergence. The key design choices are CDC-based capture for low-latency detection, per-entity conflict policies for business-aligned resolution, and checkpoint-based consumers for exactly-once semantics. Scaling is achieved through partition-level parallelism and batched writes, while operational safety depends on dead-letter queues, schema registries, and comprehensive lag monitoring.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is cross-datacenter sync and why is it challenging?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Cross-datacenter sync refers to keeping data consistent across multiple geographically separated data centers. It is challenging due to network latency, partition tolerance requirements, the CAP theorem tradeoffs, and the need to reconcile concurrent writes that occur in different locations simultaneously.”
}
},
{
“@type”: “Question”,
“name”: “How does the CAP theorem apply to cross-datacenter synchronization?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The CAP theorem states that a distributed system can guarantee only two of three properties: Consistency, Availability, and Partition tolerance. In cross-datacenter sync, network partitions are inevitable, so designers must choose between strong consistency (sacrificing availability) or high availability (accepting eventual consistency).”
}
},
{
“@type”: “Question”,
“name”: “What are common patterns for cross-datacenter data synchronization?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Common patterns include active-passive replication (one primary datacenter, others read-only), active-active replication (writes accepted at any datacenter with conflict resolution), change data capture (CDC) using log-based streaming, and gossip protocols for eventual consistency across peers.”
}
},
{
“@type”: “Question”,
“name”: “Which companies ask about cross-datacenter sync in system design interviews?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Cross-datacenter sync is frequently covered in system design interviews at Google, Amazon, and Meta. All three operate large-scale global infrastructure and ask candidates to design systems that handle replication lag, consistency guarantees, and failover across data centers in different regions.”
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top