What Is a Media Storage Service?
A media storage service is a backend system responsible for ingesting, persisting, organizing, and serving binary assets such as images, audio files, and documents. It abstracts raw object storage (S3, GCS, Azure Blob) behind a unified API, enforcing access control, deduplication, metadata indexing, and CDN integration. At scale it handles millions of uploads per day, petabytes of stored data, and low-latency reads for end users worldwide.
Data Model / Schema
-- Core asset record
CREATE TABLE media_assets (
asset_id UUID PRIMARY KEY,
owner_id BIGINT NOT NULL,
bucket VARCHAR(128) NOT NULL,
object_key VARCHAR(1024) NOT NULL,
mime_type VARCHAR(128) NOT NULL,
size_bytes BIGINT NOT NULL,
checksum_sha256 CHAR(64) NOT NULL,
status ENUM('pending','ready','deleted') DEFAULT 'pending',
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- Tag / metadata index
CREATE TABLE asset_metadata (
asset_id UUID REFERENCES media_assets(asset_id),
key VARCHAR(64) NOT NULL,
value TEXT NOT NULL,
PRIMARY KEY (asset_id, key)
);
-- Access-control list
CREATE TABLE asset_acl (
asset_id UUID REFERENCES media_assets(asset_id),
principal VARCHAR(256) NOT NULL, -- user_id, group_id, or '*'
permission ENUM('read','write','delete'),
PRIMARY KEY (asset_id, principal, permission)
);
Blob data lives in object storage keyed by bucket/object_key. The relational layer stores only metadata and pointers, keeping the database small and fast.
Core Workflow: Upload Pipeline
- Pre-signed URL generation. Client calls
POST /media/upload-url. The service validates quota and returns a time-limited pre-signed PUT URL pointing directly at object storage. This offloads bandwidth from the API tier. - Direct upload. Client streams the binary to object storage. A storage-side event (S3 EventBridge / GCS Pub/Sub) fires on completion.
- Post-upload processing. An async worker consumes the event: verifies checksum, extracts EXIF/MIME metadata, runs virus scan, and flips
statustoready. - CDN registration. The asset URL is registered with the CDN edge layer. Subsequent reads are served from cache without hitting origin storage.
Failure Handling
- Partial uploads: Pre-signed URLs expire (e.g., 15 min). A nightly cleanup job deletes
pendingrecords older than TTL and purges the orphaned object. - Processing failures: Workers use at-least-once delivery with idempotency keys (checksum). Poison-pill messages are routed to a dead-letter queue for manual inspection.
- Storage outage: Multi-region replication (cross-region replication in S3 or dual-write to GCS) ensures durability. Read traffic fails over to secondary region via Route 53 / Cloud DNS health checks.
- Corruption: SHA-256 checksum is verified server-side after upload and again during scheduled integrity scans.
Scalability Considerations
- Throughput: Pre-signed URLs bypass the API tier entirely, so upload throughput scales with object storage capacity, not application servers.
- Deduplication: Checksum lookup before insert prevents duplicate objects. Identical files share one object_key and increment a reference count.
- Metadata reads: Hot asset metadata is cached in Redis with a short TTL. Cache invalidation fires on status transitions.
- Storage tiering: Assets not accessed in 90 days are automatically transitioned to infrequent-access or Glacier-class tiers via lifecycle policies, cutting cost by ~60%.
Summary
A well-designed media storage service decouples upload bandwidth from API capacity using pre-signed URLs, keeps the relational layer lean with pointer-based metadata, and pushes reads to CDN edges. Idempotent async processing and checksum verification provide durability guarantees, while lifecycle tiering controls cost at scale.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering