Question 1

How do you choose the right shard key for a database?

Accepted Answer

The shard key determines data distribution and query efficiency. Criteria: (1) High cardinality -- the key must have enough distinct values to distribute data evenly. Sharding by country_code creates hot shards (US shard is 10x larger). Sharding by user_id distributes evenly across millions of users. (2) Query alignment -- the shard key should appear in the WHERE clause of most queries. If 90% of queries filter by user_id, sharding by user_id means each query hits one shard. Queries without the shard key require scatter-gather across all shards. (3) Write distribution -- the key should not create hot spots for writes. An auto-incrementing order_id with range sharding sends all new orders to the latest shard. Hash the order_id to distribute writes evenly. (4) Data co-location -- related data should share the same shard key. If you query a user orders with their order items, shard both tables by user_id so joins are local. Common choices: user_id for user-centric applications (social networks, SaaS), tenant_id for multi-tenant systems, hash(entity_id) for even distribution without locality. Avoid: timestamps (creates hot partitions), low-cardinality fields (country, status), and composite keys that make routing complex.

Question 2

How do you handle cross-shard queries in a sharded database?

Accepted Answer

Cross-shard queries (queries without the shard key in the WHERE clause) must be sent to all shards, executed in parallel, and the results merged -- called scatter-gather. This is orders of magnitude slower than a single-shard query. Strategies to minimize cross-shard queries: (1) Denormalization -- store frequently joined data in the same row. Instead of joining orders with users to get user_name, store user_name directly in the orders table. This duplicates data but eliminates the cross-shard join. Update denormalized data asynchronously via events. (2) Global tables -- small, read-heavy reference tables (countries, currencies, categories) are replicated to every shard. Joins with these tables are always local. The replication overhead is minimal because these tables rarely change. (3) Application-level joins -- for rare administrative queries, fetch data from each shard in the application layer and join in memory. Acceptable for reporting dashboards, not for user-facing requests. (4) Materialized views -- build pre-computed query results in a separate read-optimized store (Elasticsearch for search, ClickHouse for analytics). Feed it from change events. The materialized view handles cross-entity queries without touching the sharded database.

Question 3

How does Vitess handle MySQL sharding at scale?

Accepted Answer

Vitess is a database clustering system for horizontal scaling of MySQL, originally developed at YouTube and now a CNCF graduated project. Architecture: vtgate (query router) sits between the application and MySQL. The application connects to vtgate as if it were a regular MySQL server. vtgate parses the SQL query, determines which shard(s) to route it to based on the sharding scheme (vschema), executes the query on the appropriate shard(s), and merges results. Sharding is transparent to the application. Key features: (1) Horizontal resharding -- split a shard into two (or merge two into one) with minimal downtime. Vitess copies data, verifies consistency, and switches traffic atomically. (2) Connection pooling -- vtgate multiplexes thousands of application connections onto a small number of MySQL connections per shard. MySQL connection overhead is high; vtgate reduces it dramatically. (3) Query rewriting -- vtgate rewrites cross-shard queries into per-shard queries, executes them in parallel, and merges results. Simple aggregations (COUNT, SUM) are handled automatically. Complex queries may require application changes. Used by Slack, Square, GitHub, HubSpot, and PlanetScale (which offers Vitess as a managed service).

Question 4

When should you shard your database versus using other scaling strategies?

Accepted Answer

Sharding is a last resort because it adds significant operational complexity: cross-shard queries, distributed transactions, resharding, and application-level routing. Exhaust these alternatives first: (1) Vertical scaling -- a single PostgreSQL instance on a 96-core server with 768GB RAM and NVMe SSDs can handle billions of rows and 50,000+ transactions per second. Cloud instances with 24TB RAM exist. Most applications never outgrow a single server. (2) Read replicas -- if the bottleneck is read throughput, add read replicas. Most applications are 90%+ reads. Three read replicas triple your read capacity with no application changes. (3) Caching -- Redis caching eliminates database reads for hot data. A cache hit rate of 95% reduces database load by 20x. (4) Connection pooling -- PgBouncer or ProxySQL reduce connection overhead and allow more concurrent clients. (5) Query optimization -- missing indexes, N+1 queries, and inefficient joins are the most common performance problems. EXPLAIN ANALYZE reveals these. (6) Table partitioning -- partition large tables by date range. The database handles routing internally. Shard when: data exceeds single-server storage, write throughput exceeds single-server capacity, or regulatory requirements mandate data residency in specific regions.

System Design: Database Sharding — Horizontal Partitioning, Shard Key Selection, Resharding, Vitess, Citus

When to Shard

Shard Key Selection

Sharding Strategies

Cross-Shard Queries and Joins

Resharding: Adding and Removing Shards

Production Sharding Tools