Question 1

How does erasure coding compare to 3-way replication for object storage durability?

Accepted Answer

3-way replication stores 3 full copies of each object, providing 3x storage overhead but allowing any 2 of 3 nodes to fail while maintaining data availability. Erasure coding (e.g., Reed-Solomon 6+3) splits an object into 6 data shards and 3 parity shards, storing 9 shards total. Any 6 of the 9 shards can reconstruct the object, tolerating 3 node failures. Storage overhead is 9/6 = 1.5x versus 3x for replication. The trade-off is reconstruction cost: reading a degraded object requires fetching at least 6 shards and running the reconstruction computation, which adds latency versus reading a single replica. For warm and cold storage where durability matters more than read latency, erasure coding is preferred. For hot data requiring low read latency, replication is often used.

Question 2

How does consistent hashing determine where an object is stored?

Accepted Answer

Consistent hashing maps both storage nodes and object keys onto a circular hash ring. When storing an object, the system hashes the object key and finds the position on the ring; the object is assigned to the first node clockwise from that position. Adding or removing a node only redistributes a fraction of objects (1/N of the total) rather than all objects, which is critical for cluster rebalancing. Virtual nodes (vnodes) — each physical node owns multiple positions on the ring — are used to ensure even data distribution and smooth rebalancing when nodes have different capacities. Object placement is determined purely by the hash function and the current ring state, with no central placement coordinator required at write time.

Question 3

How does multipart upload work and how is atomicity guaranteed?

Accepted Answer

A multipart upload has three phases. First, the client calls CreateMultipartUpload and receives an upload_id. Second, the client uploads individual parts (each at least 5 MB, except the last) as independent PUT requests, receiving an ETag (checksum) per part. Parts can be uploaded in parallel and in any order. Third, the client calls CompleteMultipartUpload with the ordered list of (part_number, ETag) pairs. The storage system concatenates the parts in order and makes the resulting object atomically visible. Before CompleteMultipartUpload is called, the object is not visible to readers. If the upload is abandoned, AbortMultipartUpload cleans up the stored parts. Parts are stored as temporary objects and only promoted to a permanent object on complete.

Question 4

How do lifecycle policies transition objects between storage classes?

Accepted Answer

Lifecycle policies are rules evaluated daily by a background job. A typical rule transitions objects to a cheaper storage class after N days since creation or since last access. For example: STANDARD → STANDARD_IA after 30 days (infrequent access, lower storage cost but per-retrieval fee), STANDARD_IA → GLACIER after 90 days (archival, hours to restore), GLACIER → permanent delete after 365 days. The background job scans the StorageObject metadata table for objects matching lifecycle conditions, updates the storage_class field, migrates the physical data to the new storage backend (e.g., cold HDDs or object-level archival), and updates the object metadata. Object content is not modified during transition — only the storage medium and access cost model changes.

Question 5

How does erasure coding achieve better storage efficiency than 3-way replication?

Accepted Answer

With Reed-Solomon(6,3), 6 data shards + 3 parity shards tolerate any 3 node failures; storage overhead is 9/6 = 1.5x vs 3x for 3-way replication; the tradeoff is higher CPU cost for encoding/decoding and higher read latency when recovering from node failures.

Question 6

How does consistent hashing place objects across storage nodes?

Accepted Answer

Each storage node is assigned multiple positions on a hash ring; the object key is hashed to a ring position; the object is stored on the nearest node(s) clockwise from that position; adding/removing a node redistributes only O(1/N) of objects.

Question 7

How is multipart upload atomicity guaranteed?

Accepted Answer

All parts are uploaded to a staging area; the CompleteMultipartUpload call atomically assembles the final object; if the call fails or times out, the incomplete upload can be aborted, deleting all staged parts.

Question 8

How do lifecycle policies automate storage tier transitions?

Accepted Answer

Lifecycle rules specify age thresholds and target storage classes (e.g., move to STANDARD_IA after 30 days, GLACIER after 90 days); a background job evaluates objects against rules and calls the appropriate storage transition API.

Object Storage System Low-Level Design: Bucket Management, Data Placement, and Erasure Coding

Object Model

Data Placement: Consistent Hashing

Erasure Coding: Reed-Solomon

Multipart Upload

Checksums and Data Integrity

SQL DDL: Metadata Catalog

Python: Core Operations

Design Considerations Summary