System Design: Sharding and Data Partitioning Explained

Sharding and Data Partitioning: System Design Deep Dive

Sharding (horizontal partitioning) splits a large dataset across multiple database nodes so no single node holds all the data. It is the primary scaling strategy for write-heavy workloads that exceed single-node capacity.

Why Shard?

  • Write throughput: a single PostgreSQL master handles ~10K-50K writes/sec. Sharding multiplies this linearly.
  • Storage: a 10TB dataset across 10 shards = 1TB per node, fitting in fast SSD tiers.
  • Read parallelism: scatter-gather queries fan out to all shards and merge results.

Partitioning Strategies

Range Partitioning

Assign rows to shards based on value ranges of the shard key (e.g., user_id 1-1M → shard 0, 1M-2M → shard 1). Simple to implement and supports range scans efficiently. Risk: hot shards if data is skewed (e.g., new users all land on the latest shard).

Hash Partitioning

shard = hash(key) % N. Distributes load uniformly when keys are random. Range scans require hitting all shards. Adding shards requires rehashing — mitigated by consistent hashing.

Directory-Based Partitioning

A lookup service maps each key to its shard. Maximum flexibility (move individual keys), but the directory is a single point of failure and a bottleneck. Used by DynamoDB’s partition metadata layer.

Choosing a Shard Key

The shard key determines everything: access patterns, hot spots, join complexity.

  • High cardinality: enough distinct values to distribute across shards (user_id: good; country_code: bad for 10+ shards).
  • Query alignment: most queries should include the shard key to avoid scatter-gather. For a social app, shard by user_id so “get user’s posts” hits one shard.
  • Write distribution: avoid monotonically increasing keys (timestamps, auto-increment IDs) with hash partitioning — they create hot trailing shards. Add a random prefix or use UUID v4.

Cross-Shard Queries and Joins

Joins across shards require application-level aggregation. Strategies:

  • Denormalize: embed frequently joined fields into the primary entity (store username in every post row).
  • Scatter-gather: fan out query to all shards, merge results in the application layer. Works for aggregates (COUNT, SUM) but is expensive.
  • Secondary index shards: maintain a separate index shard that maps secondary keys (email → user_id). Two-hop reads: index shard → data shard.

Hotspot Mitigation


# Problem: celebrity user with 10M followers causes hot shard
# Solution 1: Write splitting — distribute writes across N virtual shards
def shard_key_for_celebrity(user_id, num_splits=10):
    # Writes spread across: user_id#0, user_id#1, ..., user_id#9
    return f"{user_id}#{random.randint(0, num_splits - 1)}"

# Reads must aggregate across all splits
def get_follower_count(user_id, num_splits=10):
    counts = [get_shard(f"{user_id}#{i}").follower_count for i in range(num_splits)]
    return sum(counts)

# Solution 2: Read replicas per shard — scale reads without touching write path

Resharding

When a shard grows too large or traffic skews, you must reshard. Approaches:

  • Double-write: write to old and new shard layout simultaneously, backfill old data, cut over reads, stop writing to old layout.
  • Consistent hashing with virtual nodes: add a new node; it takes a fraction of keys from all existing nodes. Only O(K/N) keys move.
  • Logical shards: provision more logical shards than physical nodes (e.g., 1024 logical shards on 8 nodes = 128 per node). Resharding becomes rebalancing logical shards — no data movement for range-split, just metadata update.

Sharding in Real Systems

System Strategy Notes
MongoDB Hash or range on shard key Config server stores chunk metadata; mongos routes queries
Cassandra Consistent hashing (token ring) 256 virtual nodes per physical node; RF=3
MySQL (Vitess) Hash on primary key Vitess adds sharding proxy layer above MySQL; resharding online
DynamoDB Hash partition + sort key Automatic partition split when a partition exceeds 10GB or 3000 RCU/1000 WCU
Redis Cluster 16384 fixed hash slots CRC16(key) % 16384; slot migration online via CLUSTER SETSLOT

Interview Tips

  • Always ask about the primary query pattern before choosing a shard key.
  • State the tradeoff: hash = uniform distribution but no range scans; range = efficient scans but hotspot risk.
  • Mention consistent hashing when discussing adding/removing nodes.
  • For social platforms: shard users by user_id, posts by (user_id, created_at) to keep a user’s posts together.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between horizontal and vertical partitioning?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Vertical partitioning (normalization) splits a table by columns u2014 moving infrequently accessed columns to a separate table. Horizontal partitioning (sharding) splits a table by rows u2014 each shard holds a disjoint subset of rows with the same schema. Sharding scales write throughput across nodes; vertical partitioning reduces row size and improves cache hit rates but stays on a single node.”
}
},
{
“@type”: “Question”,
“name”: “What is a hot shard and how do you fix it?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A hot shard receives disproportionate traffic u2014 e.g., a celebrity user on a social platform or a viral product in an e-commerce system. Fixes: (1) Write splitting: append a random suffix to the shard key and scatter writes across N virtual keys, then aggregate on read. (2) Dedicated shard: move high-traffic entities to their own shard. (3) Caching: serve reads from a cache tier (Redis) to absorb read traffic without touching the shard. (4) Repartitioning: choose a finer-grained shard key.”
}
},
{
“@type”: “Question”,
“name”: “How does DynamoDB handle automatic resharding?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “DynamoDB splits a partition automatically when it exceeds 10GB or sustains more than 3,000 read capacity units or 1,000 write capacity units. The split is transparent: the partition key space is divided at a midpoint and data migrates to a new partition. This is why monotonically increasing keys (timestamps) cause hot partitions u2014 all new writes land on the last partition until it splits. Solution: add a random prefix to distribute writes across partitions.”
}
},
{
“@type”: “Question”,
“name”: “Why are cross-shard joins problematic?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In a sharded database, data for a join (e.g., orders JOIN users) lives on different nodes. The database cannot perform a local join. Instead, the application must query both shards independently and join in memory u2014 this is scatter-gather. For large result sets, this is expensive and latency-additive. Solutions: denormalize (embed userId fields in orders), use a global secondary index shard, or ensure the query pattern always accesses a single shard by aligning the schema to the primary access pattern.”
}
},
{
“@type”: “Question”,
“name”: “What is a logical shard and how does it simplify resharding?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A logical shard is a virtual partition unit larger than a physical node can optimally serve, but smaller than a physical node’s full capacity. For example, 1024 logical shards on 8 physical nodes = 128 logical shards per node. When you add a 9th node, you rebalance logical shards u2014 moving ~113 logical shards without restructuring the keyspace. Cassandra’s virtual nodes implement this: 256 vnode tokens per physical node, so adding a node takes tokens from all existing nodes proportionally.”
}
}
]
}

Asked at: Uber Interview Guide

Asked at: Netflix Interview Guide

Asked at: Databricks Interview Guide

Asked at: Stripe Interview Guide

Asked at: Twitter/X Interview Guide

Scroll to Top