Question 1

How does erasure coding achieve 11-nines durability in object storage?

Accepted Answer

Erasure coding encodes data into N data fragments and K parity fragments (N+K total) using Reed-Solomon or similar algorithms. The object can be reconstructed from any N of the N+K fragments — tolerating K simultaneous failures. Example: S3 Standard Storage uses erasure coding with N=6, K=4 (6 data + 4 parity = 10 fragments). Any 6 fragments can reconstruct the object — 4 storage nodes or disks can fail simultaneously without data loss. Storage overhead: (N+K)/N = 10/6 = 1.67x, much better than 3x replication. Durability calculation: 11-nines (99.999999999%) comes from the extremely low probability that 5+ storage nodes fail simultaneously within the time needed to detect a failure and reconstruct the missing fragments on a new node. MTTF (mean time to failure) for individual disks is 1-3 years; reconstruction takes hours; the probability of 5 independent failures during reconstruction is astronomically low. Cross-AZ distribution: fragments are placed across multiple Availability Zones — a complete AZ outage loses at most N-1 fragments (below the reconstruction threshold). Read amplification trade-off: reading requires fetching N fragments from N nodes (vs. 1 for replication). S3 mitigates this with tiering — hot data uses replication for fast reads; cold data (Glacier) uses high-K erasure coding for cost efficiency.

Question 2

How does multipart upload work for large objects in S3?

Accepted Answer

Multipart upload splits large objects into parts that are uploaded independently and assembled server-side. For objects > 5MB (required for objects > 5GB). Three-phase process: (1) Initiate: POST to create a multipart upload, receive an upload_id. (2) Upload parts: PUT each part with part_number (1-10,000) and upload_id. Each part must be 5MB-5GB (except the last part which can be any size). Parts can be uploaded in parallel from multiple threads or machines — a 10GB file with 100 parts can upload 10x faster using 10 parallel threads. Each part upload returns an ETag (MD5 of the part). (3) Complete: POST with the upload_id and list of (part_number, ETag) pairs. S3 validates the ETags, assembles the parts into the final object atomically, and returns the ETag of the complete object. Benefits: (1) Parallelism: upload parts simultaneously to max out bandwidth. A 100Mbps connection uploading a 10GB file takes 800 seconds serial; with 10 parallel 1GB parts, it takes ~80 seconds. (2) Resumability: if part 7 fails, retry only part 7 — not the entire upload. (3) Streaming: start uploading before the entire file is available (e.g., streaming a recording in real time). Lifecycle policy: abort incomplete multipart uploads after N days to prevent orphaned parts accumulating storage costs.

Question 3

How do pre-signed URLs enable direct client-to-S3 uploads without routing through your server?

Accepted Answer

Pre-signed URLs allow clients to upload or download S3 objects directly, without routing bytes through your application server. Without pre-signed URLs: client sends file to your server → your server uploads to S3. Doubles the bandwidth consumed (client→server + server→S3) and adds your server as a bottleneck. With pre-signed URLs: (1) Client requests an upload permission from your API. (2) Your server generates a pre-signed PUT URL: the AWS SDK signs the URL with your IAM credentials, embedding bucket, key, content-type, expiry (e.g., 15 minutes), and a signature. (3) Client receives the pre-signed URL. (4) Client makes a PUT request directly to S3 using the pre-signed URL. S3 validates the signature — if valid, accepts the upload. (5) Your server receives an S3 event notification (via SQS or SNS) confirming the upload completed. (6) Your server records the file's S3 key in your database. Security: the pre-signed URL is valid only for its specified duration, bucket, key, and HTTP method. The IAM credentials used to sign are never exposed to the client. Specify content-type and content-length limits in the URL to prevent clients from uploading unexpected file types or sizes. Use CORS configuration on the bucket to allow browser-to-S3 uploads from your domain.

Question 4

How does S3 achieve strong read-after-write consistency?

Accepted Answer

S3 achieved strong read-after-write consistency for all operations in December 2020. Before 2020: S3 had eventual consistency for PUTs of new objects (a GET immediately after a PUT might return 404 or an old version). After 2020: a successful PUT guarantees that all subsequent GETs return the new object — no waiting for consistency to propagate. Implementation: S3's metadata service uses a strongly consistent distributed key-value store. When a PUT completes (returning HTTP 200), the metadata mapping (bucket+key → storage nodes) is durably committed to the metadata store before the response is sent. All subsequent GETs query the same metadata store — they immediately see the committed mapping and retrieve the new object. Strong consistency extends to: LIST operations (a key appears in LIST immediately after a successful PUT), DELETE operations (a key disappears from GET and LIST immediately after a successful DELETE), and conditional operations (If-None-Match, If-Match headers for optimistic concurrency). The only caveat: S3 does not provide read-your-writes consistency for reads from different AWS accounts or for bucket existence checks — those remain eventually consistent. For application development, the 2020 change eliminates the need for consistency workarounds (waiting, retrying after PUT) that were common before.

Object Storage System: Low-Level Design

Object Model and API

Storage Layer Architecture

Durability: Erasure Coding

Metadata Service and Consistency

Pre-Signed URLs and Access Control