Object Storage System Low-Level Design: Bucket Management, Data Placement, and Erasure Coding

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does erasure coding compare to 3-way replication for object storage durability?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “3-way replication stores 3 full copies of each object, providing 3x storage overhead but allowing any 2 of 3 nodes to fail while maintaining data availability. Erasure coding (e.g., Reed-Solomon 6+3) splits an object into 6 data shards and 3 parity shards, storing 9 shards total. Any 6 of the 9 shards can reconstruct the object, tolerating 3 node failures. Storage overhead is 9/6 = 1.5x versus 3x for replication. The trade-off is reconstruction cost: reading a degraded object requires fetching at least 6 shards and running the reconstruction computation, which adds latency versus reading a single replica. For warm and cold storage where durability matters more than read latency, erasure coding is preferred. For hot data requiring low read latency, replication is often used.”
}
},
{
“@type”: “Question”,
“name”: “How does consistent hashing determine where an object is stored?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Consistent hashing maps both storage nodes and object keys onto a circular hash ring. When storing an object, the system hashes the object key and finds the position on the ring; the object is assigned to the first node clockwise from that position. Adding or removing a node only redistributes a fraction of objects (1/N of the total) rather than all objects, which is critical for cluster rebalancing. Virtual nodes (vnodes) — each physical node owns multiple positions on the ring — are used to ensure even data distribution and smooth rebalancing when nodes have different capacities. Object placement is determined purely by the hash function and the current ring state, with no central placement coordinator required at write time.”
}
},
{
“@type”: “Question”,
“name”: “How does multipart upload work and how is atomicity guaranteed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A multipart upload has three phases. First, the client calls CreateMultipartUpload and receives an upload_id. Second, the client uploads individual parts (each at least 5 MB, except the last) as independent PUT requests, receiving an ETag (checksum) per part. Parts can be uploaded in parallel and in any order. Third, the client calls CompleteMultipartUpload with the ordered list of (part_number, ETag) pairs. The storage system concatenates the parts in order and makes the resulting object atomically visible. Before CompleteMultipartUpload is called, the object is not visible to readers. If the upload is abandoned, AbortMultipartUpload cleans up the stored parts. Parts are stored as temporary objects and only promoted to a permanent object on complete.”
}
},
{
“@type”: “Question”,
“name”: “How do lifecycle policies transition objects between storage classes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Lifecycle policies are rules evaluated daily by a background job. A typical rule transitions objects to a cheaper storage class after N days since creation or since last access. For example: STANDARD → STANDARD_IA after 30 days (infrequent access, lower storage cost but per-retrieval fee), STANDARD_IA → GLACIER after 90 days (archival, hours to restore), GLACIER → permanent delete after 365 days. The background job scans the StorageObject metadata table for objects matching lifecycle conditions, updates the storage_class field, migrates the physical data to the new storage backend (e.g., cold HDDs or object-level archival), and updates the object metadata. Object content is not modified during transition — only the storage medium and access cost model changes.”
}
}
]
}

An object storage system provides a flat, scalable store for arbitrary binary data (objects) addressed by a bucket name and key string. It is the architectural pattern behind AWS S3, Google Cloud Storage, and Azure Blob Storage — designed for massive scale, high durability, and low operational cost.

Object Model

The core data model is flat: bucket + key + version_id uniquely identifies an object. There is no real directory hierarchy — the slash in a key like photos/2024/image.jpg is just part of the key string, though the UI presents it as a folder structure via key prefix listing.

Each object has:

Data: arbitrary binary content, from bytes to terabytes (via multipart upload).
System metadata: content-type, content-length, checksum (MD5/SHA256), storage class, versioning info, created_at.
User-defined metadata: arbitrary key-value pairs stored as JSONB, returned with object GET.
Version ID: each PUT to a versioning-enabled bucket creates a new immutable version. A DELETE creates a delete marker rather than physical deletion.

Data Placement: Consistent Hashing

Storage nodes are arranged on a consistent hash ring. An object's key is hashed (e.g., SHA-256) to a position on the ring; the object is placed on the first node clockwise from that position.

Virtual nodes (vnodes): each physical node is assigned multiple positions on the ring (e.g., 150 vnodes per node). Vnodes ensure even data distribution and make rebalancing smooth — when a node is added or removed, only the objects near its vnode positions are redistributed. Without vnodes, uneven key distribution causes hotspots.

Replication: for replicated storage, the object is written to the primary node (first clockwise) plus the next R-1 nodes on the ring (R = replication factor). Nodes are chosen from different racks to tolerate rack failure.

Erasure Coding: Reed-Solomon

Reed-Solomon erasure coding splits an object into k data shards and m parity shards. Any k of the k+m shards can reconstruct the object. This tolerates m node failures with storage overhead of (k+m)/k versus k for replication.

Example (6+3 erasure coding):

Object split into 6 equal data shards + 3 parity shards = 9 shards stored on 9 distinct nodes.
Storage overhead: 9/6 = 1.5x (vs 3x for 3-way replication).
Tolerates any 3 node failures.
Read path: if all 6 data shards are available, read directly (no reconstruction). If 1–3 shards are missing, fetch all available shards and reconstruct (CPU-bound operation).

Erasure coding is ideal for warm and cold storage tiers where durability matters and read latency is less critical. Hot storage may use replication for faster degraded-mode reads.

Multipart Upload

Objects larger than 5 MB should use multipart upload to enable parallelism, resume on failure, and streaming without knowing total size upfront.

CreateMultipartUpload(bucket, key): returns upload_id. Object metadata is reserved but object is not yet visible.
UploadPart(upload_id, part_number, data): upload each part independently. Parts are stored as temporary blocks. Client retains (part_number, ETag) for each part.
CompleteMultipartUpload(upload_id, [(part_number, ETag)]): server validates ETag per part, concatenates parts in order, atomically commits the object as visible. All-or-nothing: if any part ETag mismatches, the entire upload fails.

Parts can be uploaded in parallel (e.g., 8 concurrent streams) dramatically reducing total upload time for large objects.

Checksums and Data Integrity

Per-part MD5: computed by client and sent in Content-MD5 header; server verifies on receipt.
Full-object SHA-256: computed over all part data; stored in object metadata as the authoritative checksum.
End-to-end integrity: client can request the SHA-256 of the composite object after CompleteMultipartUpload and verify against its own locally computed checksum.

SQL DDL: Metadata Catalog

CREATE TABLE StorageBucket (
    id                  BIGSERIAL PRIMARY KEY,
    name                VARCHAR(255)  NOT NULL UNIQUE,
    versioning_enabled  BOOLEAN       NOT NULL DEFAULT FALSE,
    lifecycle_policy    JSONB         NOT NULL DEFAULT '[]',
    created_at          TIMESTAMPTZ   NOT NULL DEFAULT now()
);

CREATE TABLE StorageObject (
    id              BIGSERIAL PRIMARY KEY,
    bucket_id       BIGINT        NOT NULL REFERENCES StorageBucket(id),
    key             TEXT          NOT NULL,
    version_id      VARCHAR(64)   NOT NULL,
    size_bytes      BIGINT        NOT NULL,
    content_type    VARCHAR(255),
    checksum        VARCHAR(128)  NOT NULL,  -- SHA-256 hex
    storage_class   VARCHAR(32)   NOT NULL DEFAULT 'STANDARD',
    is_delete_marker BOOLEAN      NOT NULL DEFAULT FALSE,
    user_metadata   JSONB         NOT NULL DEFAULT '{}',
    created_at      TIMESTAMPTZ   NOT NULL DEFAULT now(),
    PRIMARY KEY (bucket_id, key, version_id)
);

CREATE INDEX idx_obj_bucket_key ON StorageObject (bucket_id, key, created_at DESC);

-- Multipart upload parts
CREATE TABLE StoragePart (
    upload_id     VARCHAR(64)   NOT NULL,
    part_number   INTEGER       NOT NULL CHECK (part_number BETWEEN 1 AND 10000),
    size_bytes    BIGINT        NOT NULL,
    checksum      VARCHAR(64)   NOT NULL,  -- MD5 hex
    stored_at     TIMESTAMPTZ   NOT NULL DEFAULT now(),
    PRIMARY KEY (upload_id, part_number)
);

Python: Core Operations

import hashlib
import uuid
from typing import Optional

# Simplified in-memory object store for illustration
_buckets: dict[str, dict] = {}
_objects: dict[tuple, dict] = {}  # (bucket, key, version_id) -> object
_parts: dict[str, dict] = {}      # upload_id -> {part_number -> data}

def put_object(bucket: str, key: str, data: bytes, metadata: dict = None) -> str:
    """Store an object and return its version_id."""
    version_id = str(uuid.uuid4()).replace('-', '')
    checksum = hashlib.sha256(data).hexdigest()
    _objects[(bucket, key, version_id)] = {
        'data': data,
        'size_bytes': len(data),
        'checksum': checksum,
        'content_type': (metadata or {}).get('content_type', 'application/octet-stream'),
        'user_metadata': metadata or {},
        'storage_class': 'STANDARD',
    }
    return version_id

def get_object(bucket: str, key: str, version_id: Optional[str] = None) -> Optional[dict]:
    """Retrieve an object. Returns latest version if version_id is None."""
    if version_id:
        return _objects.get((bucket, key, version_id))
    # Find latest version by scanning (production: index by created_at DESC)
    matches = [(k, v) for k, v in _objects.items() if k[0] == bucket and k[1] == key]
    return matches[-1][1] if matches else None

def create_multipart_upload(bucket: str, key: str) -> str:
    """Initiate a multipart upload and return the upload_id."""
    upload_id = str(uuid.uuid4())
    _parts[upload_id] = {'bucket': bucket, 'key': key, 'parts': {}}
    return upload_id

def upload_part(upload_id: str, part_number: int, data: bytes) -> str:
    """Upload one part; return its ETag (MD5)."""
    etag = hashlib.md5(data).hexdigest()
    _parts[upload_id]['parts'][part_number] = {'data': data, 'etag': etag}
    return etag

def complete_multipart_upload(upload_id: str, part_etags: list[tuple[int, str]]) -> str:
    """Validate ETags, concatenate parts in order, commit object atomically."""
    upload = _parts[upload_id]
    combined = b''
    for part_number, expected_etag in sorted(part_etags):
        part = upload['parts'].get(part_number)
        if not part or part['etag'] != expected_etag:
            raise ValueError(f"Part {part_number} ETag mismatch")
        combined += part['data']
    version_id = put_object(upload['bucket'], upload['key'], combined)
    del _parts[upload_id]
    return version_id

Design Considerations Summary

Erasure coding: 1.5x overhead vs 3x replication; prefer for warm/cold storage; adds CPU cost for degraded reads.
Consistent hashing: virtual nodes are essential for even distribution; ring state must be consistent across the cluster.
Multipart upload: enables parallelism, resume on failure, and streaming; always use for objects over 100 MB.
Versioning: enables recovery from accidental deletes/overwrites; adds metadata catalog size proportional to object churn rate.
Lifecycle policies: critical for cost management at scale; automate tier transitions and expiration.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does erasure coding achieve better storage efficiency than 3-way replication?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “With Reed-Solomon(6,3), 6 data shards + 3 parity shards tolerate any 3 node failures; storage overhead is 9/6 = 1.5x vs 3x for 3-way replication; the tradeoff is higher CPU cost for encoding/decoding and higher read latency when recovering from node failures.”
}
},
{
“@type”: “Question”,
“name”: “How does consistent hashing place objects across storage nodes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each storage node is assigned multiple positions on a hash ring; the object key is hashed to a ring position; the object is stored on the nearest node(s) clockwise from that position; adding/removing a node redistributes only O(1/N) of objects.”
}
},
{
“@type”: “Question”,
“name”: “How is multipart upload atomicity guaranteed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “All parts are uploaded to a staging area; the CompleteMultipartUpload call atomically assembles the final object; if the call fails or times out, the incomplete upload can be aborted, deleting all staged parts.”
}
},
{
“@type”: “Question”,
“name”: “How do lifecycle policies automate storage tier transitions?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Lifecycle rules specify age thresholds and target storage classes (e.g., move to STANDARD_IA after 30 days, GLACIER after 90 days); a background job evaluates objects against rules and calls the appropriate storage transition API.”
}
}
]
}