Object storage is the foundation of modern cloud infrastructure. Amazon S3, Google Cloud Storage, and Azure Blob Storage store trillions of objects ranging from user uploads to database backups to ML training datasets. Understanding object storage internals — how data is distributed, protected against loss, and served efficiently — is essential for system design interviews and cloud architecture decisions.
Object Storage vs Block Storage vs File Storage
Block storage (EBS, Azure Disk): raw disk volumes attached to compute instances. Fixed-size blocks (512 bytes to 4 KB). Supports random reads/writes, filesystems, and databases. Low latency (sub-millisecond). Attached to one instance at a time. File storage (EFS, NFS): hierarchical filesystem shared across multiple instances. Supports directories, permissions, and file locking. Higher latency than block storage but supports concurrent access. Object storage (S3): flat namespace of objects identified by keys (bucket + key path). Each object contains data (bytes), metadata (key-value pairs), and a unique identifier. No hierarchy (the “/” in keys is just convention). No in-place updates — objects are immutable (overwrite creates a new version). High latency for first byte (50-100ms) but high throughput for large reads. Virtually unlimited capacity and objects. Use object storage for: user uploads (images, videos, documents), static website assets, data lake storage (Parquet, Avro files), backups, and log archives. Use block storage for: databases, operating systems, and applications requiring low-latency random I/O.
S3 Architecture Internals
S3 separates metadata from data: (1) Metadata service — stores object metadata (bucket, key, size, checksum, version, ACL, custom metadata) in a distributed key-value store. Handles ListObjects (enumerate keys in a bucket), HeadObject (retrieve metadata without data), and routing PutObject/GetObject to the correct data nodes. (2) Data service — stores the actual object bytes. Objects are split into chunks, each chunk is erasure-coded and distributed across multiple storage nodes and availability zones. A PutObject request: the metadata service validates the request (authentication, authorization, bucket exists), assigns the object to data nodes, and returns a presigned data upload URL. The client uploads chunks to the data nodes. After all chunks are stored and verified, the metadata service commits the object (makes it visible to GetObject). A GetObject request: the metadata service looks up the object location, returns the data node addresses, and the client reads directly from the data nodes. This separation allows the metadata service to be optimized for small, fast lookups while the data service is optimized for large, sequential I/O.
Data Durability with Erasure Coding
S3 promises 99.999999999% (11 nines) durability — the probability of losing an object in a given year is 0.0000000001%. This is achieved with erasure coding, not simple replication. Simple replication (3 copies): stores 3 copies of the data across 3 availability zones. 3x storage overhead. Tolerates 2 simultaneous failures. Erasure coding (e.g., Reed-Solomon 6+3): split the data into 6 data chunks and compute 3 parity chunks. Store all 9 chunks across different nodes/AZs. Any 6 of 9 chunks can reconstruct the original data. Storage overhead: 1.5x (9 chunks for 6 units of data) vs 3x for replication. Tolerates 3 simultaneous failures with half the storage cost. S3 uses a variant of erasure coding optimized for their hardware. The repair process: when a storage node fails, the system reads surviving chunks from other nodes, recomputes the missing chunks, and stores them on new nodes. This repair happens automatically and continuously. With 11 nines durability, you would need to store 10 billion objects for a billion years to expect one loss.
Multipart Upload
Large files (>100 MB) should use multipart upload instead of a single PutObject. Process: (1) Initiate — the client calls CreateMultipartUpload, receiving an upload_id. (2) Upload parts — the client splits the file into parts (5 MB to 5 GB each) and uploads each part independently with the upload_id and part_number. Parts can be uploaded in parallel for higher throughput. Parts can be uploaded from different machines. (3) Complete — the client calls CompleteMultipartUpload with the list of part ETags. S3 assembles the parts into the final object. Benefits: (1) Resumability — if a part upload fails, retry only that part, not the entire file. Critical for large files over unreliable networks. (2) Parallelism — upload parts simultaneously. A 10 GB file split into 100 MB parts: 100 parallel uploads saturate the network much faster than a single stream. (3) No size limit — individual PutObject is limited to 5 GB. Multipart upload supports objects up to 5 TB. If a multipart upload is not completed within a configurable time (lifecycle rule), S3 automatically aborts it and deletes the uploaded parts, reclaiming storage.
Presigned URLs and Access Control
Presigned URLs grant temporary access to private S3 objects without requiring the requester to have AWS credentials. The bucket owner generates a presigned URL that includes: the object key, an expiration time (1 second to 7 days), and a cryptographic signature derived from the owner credentials. Anyone with the URL can access the object until it expires. Use cases: (1) Serving private content — generate a presigned URL for each image in a user photo gallery. The URL expires in 1 hour. The user can view images without direct S3 access. (2) Direct upload from the browser — generate a presigned PUT URL. The browser uploads directly to S3, bypassing your application server. This offloads bandwidth and CPU from your backend. The presigned URL constrains: the object key, maximum file size, content type, and expiration. Security: presigned URLs should have short expiration times. If a URL is leaked, the window of unauthorized access is limited. For sensitive content, combine presigned URLs with CloudFront signed URLs (which also restrict by IP address and support shorter TTLs). Access control layers: IAM policies (who can call S3 APIs), bucket policies (bucket-level access rules), ACLs (object-level permissions — largely deprecated in favor of bucket policies), and presigned URLs (temporary access tokens).
Object Storage in System Design
When to use S3 in your design: (1) User-uploaded media — profile pictures, post images, videos. The application server generates a presigned upload URL. The client uploads directly to S3. A Lambda function triggered by S3 event processes the upload (resize images, transcode video). Serve via CloudFront CDN. (2) Static website hosting — HTML, CSS, JS, images served directly from S3 with CloudFront. No web server needed. (3) Data lake — store raw data (JSON, CSV, Parquet) in S3. Query with Athena (SQL over S3), process with Spark, or load into a data warehouse. (4) Backup and archival — database backups, log archives. Use S3 lifecycle policies to transition to cheaper storage classes: S3 Standard (frequent access) -> S3 Infrequent Access (30-day minimum, lower cost) -> S3 Glacier (archival, retrieval in minutes to hours) -> S3 Glacier Deep Archive (lowest cost, retrieval in 12 hours). Cost optimization: S3 Standard is $0.023/GB/month. Glacier Deep Archive is $0.00099/GB/month — 23x cheaper. Lifecycle policies automate the transition based on object age.