How does S3 partition trillions of objects for high request throughput?

S3 partitions the key space by hashing the object key prefix. Each partition is served by a partition server holding the metadata for its key range. Request routing: hash the key prefix to find the partition, lookup the partition server from the partition map, forward the request. When a partition becomes too hot (high requests or too many keys), it automatically splits into two smaller partitions. The famous S3 503 Slow Down error occurs when a partition is overwhelmed before splitting completes. Mitigation: distribute keys across the key space -- avoid sequential prefixes (timestamps) that concentrate writes in one partition. Use random prefixes or hash-based naming. S3 improved automatic partitioning significantly, but understanding this model explains performance characteristics and helps design key naming strategies.

How did S3 achieve strong read-after-write consistency?

Since December 2020, S3 provides strong consistency: after a successful PutObject, any subsequent GetObject returns the new version. After DeleteObject, GetObject returns 404. Implementation: the metadata service uses a strongly consistent replicated state machine. PutObject commits metadata across replicas before returning success. GetObject reads from the consistent store. Previously S3 was eventually consistent for overwrites (you might read the old version briefly after overwriting). The data is also consistently replicated -- all erasure-coded chunks are confirmed stored before the PutObject response. With versioning enabled: PutObject creates a new version. DeleteObject adds a delete marker (soft delete). All versions remain accessible by version_id, enabling point-in-time recovery.

System Design: Design AWS S3 from Scratch — Partition Strategy, Metadata Service, Durability, Consistency, Multi-Tenant

⏱ 6 min read

AWS S3 stores trillions of objects with 99.999999999% durability, serving millions of requests per second. While our earlier Object Storage guide covered S3 usage, this guide tackles designing the S3 service itself — the internal architecture of a hyperscale object storage system. This is a staff+ level system design question testing deep infrastructure knowledge.

Architecture: Metadata and Data Separation

S3 separates metadata (what objects exist, their properties) from data (the actual bytes). Metadata service: a distributed key-value store mapping (bucket, key) -> object metadata (size, checksum, version, ACL, storage class, location pointers to data nodes). This service handles: PutObject (create metadata entry), GetObject (lookup metadata, return data node locations), ListObjects (enumerate keys with prefix), HeadObject (return metadata without data), and DeleteObject (remove metadata, schedule data cleanup). The metadata service is the brain — it must be highly available, strongly consistent (a PutObject followed by GetObject must return the new object), and handle billions of keys per bucket. Data service: stores the actual object bytes across thousands of storage nodes. Objects are split into chunks, erasure-coded, and distributed across multiple nodes and availability zones. The data service is the muscle — optimized for high-throughput sequential I/O. This separation allows independent scaling: metadata scales with request rate (IOPS), data scales with storage capacity and bandwidth.

Key Partitioning and Request Routing

S3 handles millions of requests per second across trillions of objects. No single server can hold the metadata for all objects. Partitioning: the key space is partitioned by a hash of the object key prefix. Each partition is served by a partition server that holds the metadata for its key range. When a request arrives (GET /bucket/photos/2026/img001.jpg): (1) The request router hashes the key prefix to determine the partition. (2) The router looks up the partition server from the partition map. (3) The request is forwarded to the partition server. (4) The partition server looks up the object metadata and returns data node locations. Partition splitting: when a partition becomes too hot (too many requests or too many keys), it splits into two smaller partitions. This is automatic and transparent — the partition map is updated atomically. S3 famous “503 Slow Down” error occurs when a partition is overwhelmed before splitting completes. Mitigation: distribute keys across the key space. Avoid sequential key prefixes (e.g., timestamps) that concentrate all writes in one partition. Use random prefixes or hash-based key naming. S3 improved automatic partition management significantly in recent years, but understanding the partitioning model helps explain performance characteristics.

Data Durability: Erasure Coding Across AZs

S3 11 nines of durability means: store 10 billion objects for 100 years and expect to lose 1. How: (1) When an object is stored, it is split into data chunks and parity chunks using Reed-Solomon erasure coding. Typical scheme: 6 data + 3 parity = 9 total chunks. Any 6 of 9 can reconstruct the original. (2) The 9 chunks are distributed across 3 availability zones (3 chunks per AZ). Each AZ has independent power, networking, and cooling. (3) Each chunk is stored on a different storage node within its AZ. (4) Continuous background repair: when a storage node fails (disk dies, server crashes), the system detects the missing chunks (health check every few minutes). Surviving chunks are read from other nodes, the missing chunks are recomputed using erasure decoding, and new chunks are written to healthy nodes. Repair completes within hours. (5) The probability of losing enough chunks simultaneously (before repair completes) to lose data is astronomically low. This requires: simultaneous failure of multiple disks across multiple AZs during the repair window. Storage overhead: 1.5x (9 chunks for 6 data units) vs 3x for triple replication. S3 saves 50% storage cost compared to naive replication while achieving higher durability (tolerates 3 chunk losses vs 2 for triple replication).

Consistency Model

S3 achieved strong read-after-write consistency in December 2020 (previously, it was eventually consistent for overwrites and deletes). Strong consistency means: after a successful PutObject, any subsequent GetObject returns the new version. After a successful DeleteObject, any subsequent GetObject returns 404. This is remarkable for a system handling millions of requests per second across multiple AZs. Implementation: the metadata service uses a strongly consistent replicated state machine. When PutObject succeeds, the metadata is committed across replicas before the response is returned. GetObject reads from the consistent metadata store. The data itself is also consistently replicated (all erasure-coded chunks are confirmed stored before the PutObject response). Versioning: when versioning is enabled, PutObject creates a new version instead of overwriting. Each version has a version_id. GetObject without a version_id returns the latest. GetObject with a version_id returns a specific version. DeleteObject with versioning adds a delete marker (soft delete) — the object is hidden but all versions remain. This enables point-in-time recovery.

Multi-Tenancy and Isolation

S3 serves millions of customers (tenants) on shared infrastructure. Isolation requirements: (1) Data isolation — one tenant cannot read another tenant data. Enforced by: IAM authentication on every request (signature verification using the tenant secret key), bucket policies (who can access which buckets), and ACLs on individual objects. (2) Performance isolation — one tenant heavy usage should not degrade another tenant performance. Enforced by: per-account rate limiting (S3 has per-prefix request rate limits: 5,500 GET/sec and 3,500 PUT/sec per prefix partition), automatic partition splitting for hot keys, and fair scheduling across tenants on shared storage nodes. (3) Blast radius containment — a bug or failure affecting one tenant should not affect others. Cell-based architecture: S3 is divided into cells (independently deployable units). Each cell handles a subset of the traffic. A failure in one cell affects only that cell customers, not the entire service. New features are deployed cell-by-cell (canary deployment at the infrastructure level). Billing: S3 meters every request (GET, PUT, LIST, DELETE) and every byte stored per storage class. Usage is aggregated per account and billed monthly. Metering must be highly accurate (financial impact) and low-overhead (cannot add significant latency to each request).