Object Storage System Low-Level Design: Bucket Management, Data Placement, and Erasure Coding

An object storage system provides a flat, scalable store for arbitrary binary data (objects) addressed by a bucket name and key string. It is the architectural pattern behind AWS S3, Google Cloud Storage, and Azure Blob Storage — designed for massive scale, high durability, and low operational cost.

Object Model

The core data model is flat: bucket + key + version_id uniquely identifies an object. There is no real directory hierarchy — the slash in a key like photos/2024/image.jpg is just part of the key string, though the UI presents it as a folder structure via key prefix listing.

Each object has:

  • Data: arbitrary binary content, from bytes to terabytes (via multipart upload).
  • System metadata: content-type, content-length, checksum (MD5/SHA256), storage class, versioning info, created_at.
  • User-defined metadata: arbitrary key-value pairs stored as JSONB, returned with object GET.
  • Version ID: each PUT to a versioning-enabled bucket creates a new immutable version. A DELETE creates a delete marker rather than physical deletion.

Data Placement: Consistent Hashing

Storage nodes are arranged on a consistent hash ring. An object's key is hashed (e.g., SHA-256) to a position on the ring; the object is placed on the first node clockwise from that position.

Virtual nodes (vnodes): each physical node is assigned multiple positions on the ring (e.g., 150 vnodes per node). Vnodes ensure even data distribution and make rebalancing smooth — when a node is added or removed, only the objects near its vnode positions are redistributed. Without vnodes, uneven key distribution causes hotspots.

Replication: for replicated storage, the object is written to the primary node (first clockwise) plus the next R-1 nodes on the ring (R = replication factor). Nodes are chosen from different racks to tolerate rack failure.

Erasure Coding: Reed-Solomon

Reed-Solomon erasure coding splits an object into k data shards and m parity shards. Any k of the k+m shards can reconstruct the object. This tolerates m node failures with storage overhead of (k+m)/k versus k for replication.

Example (6+3 erasure coding):

  • Object split into 6 equal data shards + 3 parity shards = 9 shards stored on 9 distinct nodes.
  • Storage overhead: 9/6 = 1.5x (vs 3x for 3-way replication).
  • Tolerates any 3 node failures.
  • Read path: if all 6 data shards are available, read directly (no reconstruction). If 1–3 shards are missing, fetch all available shards and reconstruct (CPU-bound operation).

Erasure coding is ideal for warm and cold storage tiers where durability matters and read latency is less critical. Hot storage may use replication for faster degraded-mode reads.

Multipart Upload

Objects larger than 5 MB should use multipart upload to enable parallelism, resume on failure, and streaming without knowing total size upfront.

  1. CreateMultipartUpload(bucket, key): returns upload_id. Object metadata is reserved but object is not yet visible.
  2. UploadPart(upload_id, part_number, data): upload each part independently. Parts are stored as temporary blocks. Client retains (part_number, ETag) for each part.
  3. CompleteMultipartUpload(upload_id, [(part_number, ETag)]): server validates ETag per part, concatenates parts in order, atomically commits the object as visible. All-or-nothing: if any part ETag mismatches, the entire upload fails.

Parts can be uploaded in parallel (e.g., 8 concurrent streams) dramatically reducing total upload time for large objects.

Checksums and Data Integrity

  • Per-part MD5: computed by client and sent in Content-MD5 header; server verifies on receipt.
  • Full-object SHA-256: computed over all part data; stored in object metadata as the authoritative checksum.
  • End-to-end integrity: client can request the SHA-256 of the composite object after CompleteMultipartUpload and verify against its own locally computed checksum.

SQL DDL: Metadata Catalog

CREATE TABLE StorageBucket (
    id                  BIGSERIAL PRIMARY KEY,
    name                VARCHAR(255)  NOT NULL UNIQUE,
    versioning_enabled  BOOLEAN       NOT NULL DEFAULT FALSE,
    lifecycle_policy    JSONB         NOT NULL DEFAULT '[]',
    created_at          TIMESTAMPTZ   NOT NULL DEFAULT now()
);

CREATE TABLE StorageObject (
    id              BIGSERIAL PRIMARY KEY,
    bucket_id       BIGINT        NOT NULL REFERENCES StorageBucket(id),
    key             TEXT          NOT NULL,
    version_id      VARCHAR(64)   NOT NULL,
    size_bytes      BIGINT        NOT NULL,
    content_type    VARCHAR(255),
    checksum        VARCHAR(128)  NOT NULL,  -- SHA-256 hex
    storage_class   VARCHAR(32)   NOT NULL DEFAULT 'STANDARD',
    is_delete_marker BOOLEAN      NOT NULL DEFAULT FALSE,
    user_metadata   JSONB         NOT NULL DEFAULT '{}',
    created_at      TIMESTAMPTZ   NOT NULL DEFAULT now(),
    PRIMARY KEY (bucket_id, key, version_id)
);

CREATE INDEX idx_obj_bucket_key ON StorageObject (bucket_id, key, created_at DESC);

-- Multipart upload parts
CREATE TABLE StoragePart (
    upload_id     VARCHAR(64)   NOT NULL,
    part_number   INTEGER       NOT NULL CHECK (part_number BETWEEN 1 AND 10000),
    size_bytes    BIGINT        NOT NULL,
    checksum      VARCHAR(64)   NOT NULL,  -- MD5 hex
    stored_at     TIMESTAMPTZ   NOT NULL DEFAULT now(),
    PRIMARY KEY (upload_id, part_number)
);

Python: Core Operations

import hashlib
import uuid
from typing import Optional

# Simplified in-memory object store for illustration
_buckets: dict[str, dict] = {}
_objects: dict[tuple, dict] = {}  # (bucket, key, version_id) -> object
_parts: dict[str, dict] = {}      # upload_id -> {part_number -> data}

def put_object(bucket: str, key: str, data: bytes, metadata: dict = None) -> str:
    """Store an object and return its version_id."""
    version_id = str(uuid.uuid4()).replace('-', '')
    checksum = hashlib.sha256(data).hexdigest()
    _objects[(bucket, key, version_id)] = {
        'data': data,
        'size_bytes': len(data),
        'checksum': checksum,
        'content_type': (metadata or {}).get('content_type', 'application/octet-stream'),
        'user_metadata': metadata or {},
        'storage_class': 'STANDARD',
    }
    return version_id

def get_object(bucket: str, key: str, version_id: Optional[str] = None) -> Optional[dict]:
    """Retrieve an object. Returns latest version if version_id is None."""
    if version_id:
        return _objects.get((bucket, key, version_id))
    # Find latest version by scanning (production: index by created_at DESC)
    matches = [(k, v) for k, v in _objects.items() if k[0] == bucket and k[1] == key]
    return matches[-1][1] if matches else None

def create_multipart_upload(bucket: str, key: str) -> str:
    """Initiate a multipart upload and return the upload_id."""
    upload_id = str(uuid.uuid4())
    _parts[upload_id] = {'bucket': bucket, 'key': key, 'parts': {}}
    return upload_id

def upload_part(upload_id: str, part_number: int, data: bytes) -> str:
    """Upload one part; return its ETag (MD5)."""
    etag = hashlib.md5(data).hexdigest()
    _parts[upload_id]['parts'][part_number] = {'data': data, 'etag': etag}
    return etag

def complete_multipart_upload(upload_id: str, part_etags: list[tuple[int, str]]) -> str:
    """Validate ETags, concatenate parts in order, commit object atomically."""
    upload = _parts[upload_id]
    combined = b''
    for part_number, expected_etag in sorted(part_etags):
        part = upload['parts'].get(part_number)
        if not part or part['etag'] != expected_etag:
            raise ValueError(f"Part {part_number} ETag mismatch")
        combined += part['data']
    version_id = put_object(upload['bucket'], upload['key'], combined)
    del _parts[upload_id]
    return version_id

Design Considerations Summary

  • Erasure coding: 1.5x overhead vs 3x replication; prefer for warm/cold storage; adds CPU cost for degraded reads.
  • Consistent hashing: virtual nodes are essential for even distribution; ring state must be consistent across the cluster.
  • Multipart upload: enables parallelism, resume on failure, and streaming; always use for objects over 100 MB.
  • Versioning: enables recovery from accidental deletes/overwrites; adds metadata catalog size proportional to object churn rate.
  • Lifecycle policies: critical for cost management at scale; automate tier transitions and expiration.

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

See also: Atlassian Interview Guide

Scroll to Top