An object storage system stores arbitrary binary data (files, images, videos, backups) addressable by a globally unique key, with near-infinite scalability and high durability. Unlike block storage (attached disks) or file systems (hierarchical directories), object storage has a flat namespace and is accessed via HTTP APIs. Amazon S3 is the reference implementation; understanding its architecture is essential for system design interviews about file storage, media platforms, or data lakes.
Object Model and API
An object store organizes data into buckets (top-level namespaces) containing objects (key-value pairs where the value is a binary blob). Object key: a string path within the bucket (user-uploads/profile/123/avatar.jpg). Object metadata: content-type, content-length, etag (MD5 hash of the object), custom user-defined headers, and system metadata (last-modified, version-id). Core API operations: PUT (upload an object), GET (download by key), DELETE (remove), HEAD (fetch metadata without body), LIST (enumerate keys with a prefix). Multipart upload: for objects > 5MB, S3 allows uploading in parts (5MB-5GB each) and assembling server-side. Benefits: parallel uploads (upload parts simultaneously from multiple threads), resumable uploads (a failed part can be retried without restarting), and streaming very large files without loading them entirely into memory. S3 supports up to 10,000 parts per object and objects up to 5TB.
Storage Layer Architecture
S3’s storage architecture (based on Amazon’s papers): objects are stored across a distributed storage cluster. A metadata service maps (bucket, key) → a list of storage nodes where the object’s chunks are stored. Upload flow: (1) Client sends PUT to an S3 frontend server. (2) Frontend generates an object ID and looks up or assigns storage nodes from the storage placement service (based on available capacity, fault domain — different racks and AZs). (3) Data is written to multiple storage nodes in parallel (3-way replication in S3’s standard storage class). (4) When all writes acknowledge: the metadata service atomically records the mapping (bucket+key → object_id + storage nodes). (5) Return 200 to the client with the ETag. Chunk-based storage: large objects are split into fixed-size chunks (64MB or 128MB). Each chunk is stored as a separate unit on storage nodes. The metadata service tracks chunks per object. This enables efficient partial reads (S3 Range GET: retrieve bytes 0-1023 of a 10GB object reads only the relevant chunks).
Durability: Erasure Coding
S3 achieves 99.999999999% (11 nines) durability. Simple 3x replication uses 3x storage overhead. At S3’s scale (exabytes), 3x replication is too expensive. Erasure coding: encode each object (or chunk) into N + K fragments using a Reed-Solomon or similar code. Store N original data fragments + K parity fragments across N + K storage nodes. The object can be reconstructed from any N fragments — tolerating K failures. S3 uses erasure coding for objects in the standard storage class: a 1MB object might be encoded as 6 data fragments + 4 parity fragments across 10 storage nodes in different availability zones. Any 4 nodes can fail and the object is still recoverable. Storage overhead: (N+K)/N = 10/6 = 1.67x — much better than 3x replication. Trade-off: reconstruction requires reading N fragments from N nodes and computing the decode — read amplification. For hot frequently-accessed data, 3x replication has lower read latency. For cold data (Glacier), erasure coding with high K is preferred for cost efficiency.
Metadata Service and Consistency
The metadata service is the heart of object storage — it maps keys to storage locations. Requirements: highly available (can’t go down — all reads and writes depend on it), strongly consistent (a PUT must be immediately visible to subsequent GETs — S3 achieved strong consistency in 2020). Implementation: a distributed key-value store with strong consistency semantics (similar to etcd or HBase). S3 uses a custom metadata store backed by a distributed Paxos-based system. Key operations: atomic compare-and-swap for conditional writes (If-None-Match: * header — only write if the key doesn’t exist), versioning support (each PUT creates a new version ID; the metadata service maintains a version chain), lifecycle policies (automatically delete objects after N days — implemented as scheduled metadata scans). Consistent listing: LIST operations return all objects with a given prefix in lexicographic order. Achieving consistent listing at scale (a bucket with billions of objects) requires the metadata index to be fully consistent — eventual consistency in listing was a major S3 limitation for many years before the 2020 consistency upgrade.
Pre-Signed URLs and Access Control
Direct client uploads to S3: instead of routing uploads through your application server (adding latency and server bandwidth cost), generate a pre-signed URL that allows a client to upload directly to S3 with temporary credentials. Pre-signed URL generation: AWS SDK signs a URL with your credentials, specifying bucket, key, HTTP method (PUT), and expiry (e.g., 15 minutes). The client uses the pre-signed URL to PUT the file directly to S3 — your server never sees the bytes. Your server receives a callback (S3 event notification → SQS → your application) when the upload completes. Access control: bucket policies (who can read/write which keys), IAM roles (EC2 instances get read access without storing credentials), ACLs (per-object access control lists — deprecated in favor of bucket policies). CORS configuration: allow browser-based uploads from specific origins. Pre-signed download URLs: generate a temporary link to a private object — expires in 1 hour. The link authenticates the request using query parameters (AWSAccessKeyId, Signature, Expires) — no cookies or headers required. Useful for serving private media files directly from S3 without a proxy.