Low Level Design: Shard and Partition Key Design

⏱ 4 min read

Sharding (horizontal partitioning) splits a large dataset across multiple database nodes to scale beyond what a single machine can handle. The partition key (shard key) determines which node stores each record. A poor shard key creates hot spots — one shard receives disproportionate traffic while others sit idle. A good shard key distributes load evenly, minimizes cross-shard queries, and allows the system to scale. MongoDB sharding, Cassandra partition keys, DynamoDB partition keys, and Vitess MySQL sharding all require careful shard key design.

Hash Partitioning vs. Range Partitioning

Hash partitioning: apply a hash function to the shard key; the result modulo N determines the shard. Uniform distribution — no hot spots for random access patterns. Problem: range queries require querying all shards (no locality for sequential scans). Problem: adding or removing shards requires re-hashing a large fraction of data (consistent hashing solves this). Range partitioning: divide the key space into ordered ranges; each shard owns a range (shard 1: A-M, shard 2: N-Z; or timestamps: shard 1: Jan-Mar, shard 2: Apr-Jun). Efficient range scans within a shard. Problem: sequential inserts (timestamps, auto-increment IDs) all go to the last shard — write hot spot. Google BigTable and HBase use range partitioning with automatic shard splits.

-- Bad shard key: timestamp (write hot spot — all new rows go to one shard)
-- Bad shard key: sequential user_id (hot spot on highest user IDs)
-- Bad shard key: low-cardinality field like status or country_code (too few shards)

-- Good shard key examples:

-- DynamoDB: composite key for multi-tenant SaaS
-- Partition key: tenant_id (distributes tenants across partitions)
-- Sort key: created_at (enables range scans within a tenant)
CREATE TABLE events (
    tenant_id VARCHAR(50),  -- partition key: high cardinality, even distribution
    created_at TIMESTAMP,   -- sort key: range queries within a tenant
    event_id UUID,
    data JSONB,
    PRIMARY KEY (tenant_id, created_at, event_id)
);

-- Cassandra: time-series with bucketed partition key to avoid large partitions
-- Problem: one partition per device_id grows unboundedly over time
-- Solution: bucket by time period to cap partition size
-- Partition key: (device_id, bucket) where bucket = YYYY-MM-DD
-- Sort key: timestamp (efficient time range queries within a day)
CREATE TABLE sensor_readings (
    device_id TEXT,
    bucket TEXT,   -- device-123-2024-01-15 caps partition at one day
    ts TIMESTAMP,
    value DOUBLE,
    PRIMARY KEY ((device_id, bucket), ts)
);

Hot Spot Detection and Mitigation

A hot shard is one that receives significantly more reads or writes than others. Detection: monitor per-shard CPU, IOPS, and request rate; a hot shard shows 5-10x higher load than average. Common causes: shard key with low cardinality (few unique values); a single high-traffic entity (a viral user, trending product). Mitigation strategies: Key salting: append a random suffix to the shard key (user_id + rand(0,10) → 10 sub-shards per user). Reads must fan out to all sub-shards and aggregate. Compound shard key: combine two fields (user_id + content_type) to create more unique shard keys. Application-level sharding: split a hot entity (a celebrity Twitter account) across multiple shards with application-level merge. Dedicated shard: move the hot entity to its own shard with extra capacity.

Cross-Shard Queries and Joins

Sharding makes cross-shard queries expensive: a query that touches data on multiple shards must fan out requests to all relevant shards and aggregate results in the application layer (scatter-gather). This is why sharding trades query flexibility for scalability. Strategies to minimize cross-shard queries: Co-location: store related data on the same shard. In a multi-tenant SaaS, shard by tenant_id — all tenant data is on one shard, eliminating cross-shard queries for tenant-scoped operations. Denormalization: duplicate data that would require cross-shard joins. Store the user name alongside each post rather than joining to a users shard. Global secondary indexes: DynamoDB global secondary indexes provide an alternative access pattern across all shards (with eventually consistent reads).

Key Interview Discussion Points

Resharding: as data grows, adding shards requires redistributing data; consistent hashing minimizes resharding by moving only K/n keys (K = total keys, n = new shard count) instead of all keys; Vitess handles MySQL resharding online without downtime
Directory-based sharding: maintain a lookup table mapping shard keys to shards; more flexible than hash/range but adds a lookup hop on every request (the directory becomes a bottleneck/single point of failure)
Tenant isolation in SaaS: single-tenant-per-shard (full isolation, no noisy neighbors, easy per-tenant backup/compliance) vs. multi-tenant-per-shard (more efficient but noisy neighbor risk)
Write sharding for counters: a single counter (page view count) cannot be sharded by key; use write sharding (update one of N counter shards at random; read sums all N) to distribute counter writes
Shard exhaustion: if shard keys are assigned sequentially and the range is too small, all values cluster in a few shards; use UUID v4 or a hash prefix to spread new records randomly