An object storage system provides a flat, scalable store for arbitrary binary data (objects) addressed by a bucket name and key string. It is the architectural pattern behind AWS S3, Google Cloud Storage, and Azure Blob Storage — designed for massive scale, high durability, and low operational cost.
Object Model
The core data model is flat: bucket + key + version_id uniquely identifies an object. There is no real directory hierarchy — the slash in a key like photos/2024/image.jpg is just part of the key string, though the UI presents it as a folder structure via key prefix listing.
Each object has:
- Data: arbitrary binary content, from bytes to terabytes (via multipart upload).
- System metadata: content-type, content-length, checksum (MD5/SHA256), storage class, versioning info, created_at.
- User-defined metadata: arbitrary key-value pairs stored as JSONB, returned with object GET.
- Version ID: each PUT to a versioning-enabled bucket creates a new immutable version. A DELETE creates a delete marker rather than physical deletion.
Data Placement: Consistent Hashing
Storage nodes are arranged on a consistent hash ring. An object's key is hashed (e.g., SHA-256) to a position on the ring; the object is placed on the first node clockwise from that position.
Virtual nodes (vnodes): each physical node is assigned multiple positions on the ring (e.g., 150 vnodes per node). Vnodes ensure even data distribution and make rebalancing smooth — when a node is added or removed, only the objects near its vnode positions are redistributed. Without vnodes, uneven key distribution causes hotspots.
Replication: for replicated storage, the object is written to the primary node (first clockwise) plus the next R-1 nodes on the ring (R = replication factor). Nodes are chosen from different racks to tolerate rack failure.
Erasure Coding: Reed-Solomon
Reed-Solomon erasure coding splits an object into k data shards and m parity shards. Any k of the k+m shards can reconstruct the object. This tolerates m node failures with storage overhead of (k+m)/k versus k for replication.
Example (6+3 erasure coding):
- Object split into 6 equal data shards + 3 parity shards = 9 shards stored on 9 distinct nodes.
- Storage overhead: 9/6 = 1.5x (vs 3x for 3-way replication).
- Tolerates any 3 node failures.
- Read path: if all 6 data shards are available, read directly (no reconstruction). If 1–3 shards are missing, fetch all available shards and reconstruct (CPU-bound operation).
Erasure coding is ideal for warm and cold storage tiers where durability matters and read latency is less critical. Hot storage may use replication for faster degraded-mode reads.
Multipart Upload
Objects larger than 5 MB should use multipart upload to enable parallelism, resume on failure, and streaming without knowing total size upfront.
- CreateMultipartUpload(bucket, key): returns upload_id. Object metadata is reserved but object is not yet visible.
- UploadPart(upload_id, part_number, data): upload each part independently. Parts are stored as temporary blocks. Client retains (part_number, ETag) for each part.
- CompleteMultipartUpload(upload_id, [(part_number, ETag)]): server validates ETag per part, concatenates parts in order, atomically commits the object as visible. All-or-nothing: if any part ETag mismatches, the entire upload fails.
Parts can be uploaded in parallel (e.g., 8 concurrent streams) dramatically reducing total upload time for large objects.
Checksums and Data Integrity
- Per-part MD5: computed by client and sent in Content-MD5 header; server verifies on receipt.
- Full-object SHA-256: computed over all part data; stored in object metadata as the authoritative checksum.
- End-to-end integrity: client can request the SHA-256 of the composite object after CompleteMultipartUpload and verify against its own locally computed checksum.
SQL DDL: Metadata Catalog
CREATE TABLE StorageBucket (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL UNIQUE,
versioning_enabled BOOLEAN NOT NULL DEFAULT FALSE,
lifecycle_policy JSONB NOT NULL DEFAULT '[]',
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE StorageObject (
id BIGSERIAL PRIMARY KEY,
bucket_id BIGINT NOT NULL REFERENCES StorageBucket(id),
key TEXT NOT NULL,
version_id VARCHAR(64) NOT NULL,
size_bytes BIGINT NOT NULL,
content_type VARCHAR(255),
checksum VARCHAR(128) NOT NULL, -- SHA-256 hex
storage_class VARCHAR(32) NOT NULL DEFAULT 'STANDARD',
is_delete_marker BOOLEAN NOT NULL DEFAULT FALSE,
user_metadata JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (bucket_id, key, version_id)
);
CREATE INDEX idx_obj_bucket_key ON StorageObject (bucket_id, key, created_at DESC);
-- Multipart upload parts
CREATE TABLE StoragePart (
upload_id VARCHAR(64) NOT NULL,
part_number INTEGER NOT NULL CHECK (part_number BETWEEN 1 AND 10000),
size_bytes BIGINT NOT NULL,
checksum VARCHAR(64) NOT NULL, -- MD5 hex
stored_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (upload_id, part_number)
);
Python: Core Operations
import hashlib
import uuid
from typing import Optional
# Simplified in-memory object store for illustration
_buckets: dict[str, dict] = {}
_objects: dict[tuple, dict] = {} # (bucket, key, version_id) -> object
_parts: dict[str, dict] = {} # upload_id -> {part_number -> data}
def put_object(bucket: str, key: str, data: bytes, metadata: dict = None) -> str:
"""Store an object and return its version_id."""
version_id = str(uuid.uuid4()).replace('-', '')
checksum = hashlib.sha256(data).hexdigest()
_objects[(bucket, key, version_id)] = {
'data': data,
'size_bytes': len(data),
'checksum': checksum,
'content_type': (metadata or {}).get('content_type', 'application/octet-stream'),
'user_metadata': metadata or {},
'storage_class': 'STANDARD',
}
return version_id
def get_object(bucket: str, key: str, version_id: Optional[str] = None) -> Optional[dict]:
"""Retrieve an object. Returns latest version if version_id is None."""
if version_id:
return _objects.get((bucket, key, version_id))
# Find latest version by scanning (production: index by created_at DESC)
matches = [(k, v) for k, v in _objects.items() if k[0] == bucket and k[1] == key]
return matches[-1][1] if matches else None
def create_multipart_upload(bucket: str, key: str) -> str:
"""Initiate a multipart upload and return the upload_id."""
upload_id = str(uuid.uuid4())
_parts[upload_id] = {'bucket': bucket, 'key': key, 'parts': {}}
return upload_id
def upload_part(upload_id: str, part_number: int, data: bytes) -> str:
"""Upload one part; return its ETag (MD5)."""
etag = hashlib.md5(data).hexdigest()
_parts[upload_id]['parts'][part_number] = {'data': data, 'etag': etag}
return etag
def complete_multipart_upload(upload_id: str, part_etags: list[tuple[int, str]]) -> str:
"""Validate ETags, concatenate parts in order, commit object atomically."""
upload = _parts[upload_id]
combined = b''
for part_number, expected_etag in sorted(part_etags):
part = upload['parts'].get(part_number)
if not part or part['etag'] != expected_etag:
raise ValueError(f"Part {part_number} ETag mismatch")
combined += part['data']
version_id = put_object(upload['bucket'], upload['key'], combined)
del _parts[upload_id]
return version_id
Design Considerations Summary
- Erasure coding: 1.5x overhead vs 3x replication; prefer for warm/cold storage; adds CPU cost for degraded reads.
- Consistent hashing: virtual nodes are essential for even distribution; ring state must be consistent across the cluster.
- Multipart upload: enables parallelism, resume on failure, and streaming; always use for objects over 100 MB.
- Versioning: enables recovery from accidental deletes/overwrites; adds metadata catalog size proportional to object churn rate.
- Lifecycle policies: critical for cost management at scale; automate tier transitions and expiration.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
See also: Atlassian Interview Guide