Sharding & Partitioning
Splitting a database into smaller pieces either by rows (horizontal) or by columns (vertical)
Overview
Sharding (horizontal partitioning) splits a database into smaller, more manageable pieces called shards. Each shard contains a subset of the data and can be on a different server.
This is essential for scaling beyond what a single database server can handle, but introduces significant complexity.
Key Concepts
Shard Key
The field used to determine which shard data belongs to. Common choices: user_id, geographic region, hash of ID.
Horizontal Partitioning (Sharding)
Splitting rows across multiple databases. User 1-1000 on Shard1, 1001-2000 on Shard2.
Vertical Partitioning
Splitting columns across tables. User profile in Table1, user preferences in Table2.
Hash-Based Sharding
Use hash function on shard key to determine shard. Good distribution but hard to rebalance.
Range-Based Sharding
Split by ranges (IDs 1-1M on Shard1, 1M-2M on Shard2). Easy to add shards but can create hotspots.
How It Works
Example: Shard by user_id
Shard 1: Users 1-1,000,000 Shard 2: Users 1,000,001-2,000,000 Shard 3: Users 2,000,001-3,000,000
Query Flow:
- Get user 1,234,567
- Calculate shard: user_id / 1M = Shard 2
- Query only Shard 2
- Return result
Cross-Shard Query: SELECT * FROM users WHERE created_at > '2024-01-01' Problem: Must query ALL shards and merge results
- Much slower, avoid if possible
Use Cases
Applications with massive data (billions of rows)
Multi-tenant applications (shard by tenant_id)
Social media (shard by user_id)
Time-series data (shard by time range)
Geographic applications (shard by region)
Best Practices
Choose shard key carefully (even distribution, avoid hotspots)
Shard by entity that naturally partitions (user_id, tenant_id)
Avoid cross-shard queries and joins
Implement application-level routing
Plan for resharding from the start
Use consistent hashing for easier rebalancing
Monitor shard sizes and rebalance proactively
Consider managed sharding solutions (Vitess, Citus)
Interview Tips
What Interviewers Look For
- •
Explain horizontal vs vertical partitioning
- •
Discuss shard key selection: must distribute evenly, align with query patterns
- •
Mention challenges: cross-shard queries, joins, transactions
- •
Talk about sharding strategies: hash-based, range-based, geographic
- •
Explain consistent hashing for rebalancing
- •
Discuss when to shard: when single DB can't handle load (typically 100s of GBs or millions of QPS)
- •
Mention alternatives: scale vertically first, use read replicas, then shard as last resort