Low Level Design: Global Database Design

What Is a Global Database Design?

A globally distributed database serves reads and writes from multiple geographic regions while providing a unified data interface to applications. The goal is to minimize read latency for every user on the planet, tolerate regional outages without data loss, and maintain predictable consistency guarantees. This design is foundational to products like global e-commerce platforms, social networks, and real-time collaboration tools. It draws on principles from Google Spanner, Amazon Aurora Global, and CockroachDB.

Data Model and Schema

global_kv_store — the core versioned key-value layer that all higher-level tables are built on:

CREATE TABLE global_kv_store (
  namespace    VARCHAR(64)   NOT NULL,
  key          VARCHAR(256)  NOT NULL,
  value        BYTEA         NOT NULL,
  version      BIGINT        NOT NULL,
  region_origin VARCHAR(32)  NOT NULL,
  hybrid_ts    BIGINT        NOT NULL,
  is_tombstone BOOLEAN       NOT NULL DEFAULT FALSE,
  PRIMARY KEY (namespace, key, version DESC)
);

The hybrid_ts column stores a Hybrid Logical Clock (HLC) timestamp that combines physical wall time with a monotonic counter, enabling causal ordering without requiring perfectly synchronized clocks.

region_topology — live map of regions, their roles, and replication lag:

CREATE TABLE region_topology (
  region_id    VARCHAR(32)   NOT NULL,
  role         ENUM('PRIMARY','SECONDARY','WITNESS') NOT NULL,
  endpoints    JSON          NOT NULL,
  current_lag_ms INT         NOT NULL DEFAULT 0,
  healthy      BOOLEAN       NOT NULL DEFAULT TRUE,
  last_seen    DATETIME(6)   NOT NULL,
  PRIMARY KEY (region_id)
);

distributed_txn_log — two-phase commit coordination record:

CREATE TABLE distributed_txn_log (
  txn_id       CHAR(36)      NOT NULL,
  coordinator  VARCHAR(32)   NOT NULL,
  participants JSON          NOT NULL,
  status       ENUM('PREPARE','COMMIT','ABORT') NOT NULL,
  started_at   DATETIME(6)   NOT NULL,
  resolved_at  DATETIME(6),
  PRIMARY KEY (txn_id)
);

Core Algorithm and Workflow

Read path: The client library reads the nearest region that satisfies the requested consistency level. For stale reads any secondary suffices. For bounded-staleness reads the router checks current_lag_ms in region_topology and selects any region within the bound. For strong reads the request is routed to the primary region or a quorum read is performed across a majority of regions.

Write path (single region): Writes go to the primary, increment the HLC, append to the replication stream, and fan out to secondaries asynchronously. The primary acknowledges the client after a local durable commit.

Write path (cross-region transactions): Multi-region transactions use a two-phase commit coordinated by the region closest to the majority of participating shards. The coordinator writes a PREPARE record to distributed_txn_log, collects votes from all participant regions, then broadcasts COMMIT or ABORT and updates the log. Paxos-based consensus (per shard) ensures no two coordinators commit conflicting transactions.

Clock synchronization: TrueTime or NTP with bounded uncertainty ensures HLC timestamps are globally comparable within a small epsilon. Transactions that touch data whose uncertainty windows overlap are serialized conservatively.

Failure Handling

Regional outage: If the primary region fails, a consensus election among the remaining healthy regions promotes a new primary. The region_topology table is updated via a separate highly available metadata service (e.g., a Raft group). In-flight two-phase-commit transactions whose coordinators were in the failed region are recovered by participants after a lease timeout.

Split-brain prevention: Each region holds a time-bounded lease from the metadata service. A region without a valid lease refuses to accept writes, preventing divergent primaries.

Partial replication lag: If a secondary lags beyond a configurable threshold, the router stops sending bounded-staleness reads to it and marks it degraded in region_topology. Automated remediation attempts to catch the replica up; if it fails within SLA, an alert escalates to on-call engineers.

Network partitions: A partitioned secondary continues serving stale reads if configured for eventual consistency. It buffers incoming CDC events locally and replays them in order upon reconnection, applying HLC-based conflict resolution for any concurrent writes that occurred during the partition.

Scalability Considerations

Horizontal sharding: The namespace + key space is range-partitioned into tablets of ~10 GB each. Tablets split automatically when they exceed capacity and are rebalanced across nodes by a central tablet server manager.
Read scaling: Secondary regions absorb the majority of global read traffic. Adding a new region requires only provisioning nodes and pointing them at the replication stream; no application changes are needed.
Write scaling: Independent namespaces shard independently, so a high-write namespace does not contend with others. Within a namespace, hot-key mitigation uses write coalescing and micro-sharding.
Cost optimization: Witness regions store only metadata and participate in consensus votes without holding full data copies, reducing cross-region storage costs while maintaining quorum availability.

Summary

A global database design combines Hybrid Logical Clocks for causal ordering, Paxos/Raft-based consensus for durability, and a tiered consistency model so applications pay only for the guarantees they need. The routing layer, fed by live topology metadata, dynamically selects the best region for every request. Failure recovery relies on leased primaries, two-phase-commit recovery protocols, and automated replica remediation. Together these mechanisms deliver planet-scale availability with strong correctness guarantees.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a globally distributed database?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A globally distributed database stores and replicates data across multiple geographic regions to provide low-latency access, fault tolerance, and high availability worldwide. Examples include Google Spanner, Amazon DynamoDB Global Tables, and CockroachDB, each offering different tradeoffs between consistency and performance.”
}
},
{
“@type”: “Question”,
“name”: “How does Google Spanner achieve global consistency?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Google Spanner achieves global strong consistency using TrueTime, a globally synchronized clock API backed by GPS and atomic clocks. TrueTime allows Spanner to assign globally ordered timestamps to transactions, enabling external consistency (linearizability) across all regions without sacrificing availability under normal operations.”
}
},
{
“@type”: “Question”,
“name”: “What sharding strategies are used in global database design?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Global databases use strategies such as range-based sharding (partitioning by key ranges), hash-based sharding (distributing load evenly via hash functions), geographic sharding (co-locating data near its users), and directory-based sharding (a lookup service maps keys to shards). Each strategy balances hotspot avoidance, locality, and rebalancing complexity.”
}
},
{
“@type”: “Question”,
“name”: “Which companies ask global database design questions in interviews?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Global database design is a senior-level system design topic commonly asked at Google, Amazon, and Databricks. These companies build and maintain petabyte-scale distributed databases and expect candidates to reason about consistency models, partitioning strategies, replication topologies, and latency SLAs across regions.”
}
}
]
}