What is a Media Upload Service?
A media upload service handles ingest, validation, processing, and storage of user-generated files: profile pictures, product photos, video clips, documents. The service must handle large files without blocking web servers, process uploads asynchronously (resize, transcode, virus scan), and serve processed media efficiently from a CDN. Instagram, YouTube, and Dropbox are built on similar patterns.
Requirements
- Upload images (up to 20MB) and videos (up to 2GB)
- Validate file type (MIME check, not just extension) and scan for malware
- Process images: resize to multiple resolutions (thumbnail, medium, large)
- Process videos: transcode to H.264/AAC, generate thumbnail at 1s mark
- Serve processed media via CDN with cache headers
- 50K image uploads/day, 5K video uploads/day
Upload Flow: Presigned URLs
Never stream large files through your application server — it wastes resources and adds latency. Use S3 presigned URLs to upload directly from the client to object storage:
# 1. Client requests an upload URL from your API
POST /api/media/upload-url
{ "filename": "photo.jpg", "content_type": "image/jpeg", "size_bytes": 4200000 }
# 2. API server generates presigned URL (no file goes through your server)
def get_upload_url(filename, content_type, size_bytes):
media_id = uuid4()
object_key = f'uploads/raw/{media_id}/{filename}'
presigned_url = s3.generate_presigned_url(
'put_object',
Params={'Bucket': RAW_BUCKET, 'Key': object_key,
'ContentType': content_type},
ExpiresIn=900 # 15 minutes
)
# Record pending upload
db.insert(Media(id=media_id, status='PENDING', raw_key=object_key, ...))
return {'media_id': media_id, 'upload_url': presigned_url}
# 3. Client PUTs file directly to S3
PUT {presigned_url}
Content-Type: image/jpeg
[file bytes]
# 4. Client notifies API that upload is complete
POST /api/media/{media_id}/complete
Async Processing Pipeline
After upload is complete, trigger async processing via a job queue:
Media(media_id UUID, user_id UUID, original_filename VARCHAR,
content_type VARCHAR, size_bytes BIGINT,
status ENUM(PENDING, PROCESSING, READY, FAILED),
raw_key VARCHAR, -- S3 key of original file
processed_keys JSONB, -- {'thumbnail': 'media/thumb/...', 'medium': '...'}
error_message TEXT,
created_at, processed_at)
Processing worker (triggered by SQS/RabbitMQ message):
def process_media(media_id):
media = db.get(media_id)
raw_file = s3.get_object(RAW_BUCKET, media.raw_key)
# 1. Validate MIME type (read magic bytes, not file extension)
actual_type = magic.from_buffer(raw_file[:2048], mime=True)
if actual_type != media.content_type:
mark_failed(media_id, 'MIME mismatch'); return
# 2. Malware scan
result = clamav.scan(raw_file)
if result.found:
mark_failed(media_id, 'Malware detected'); return
if media.content_type.startswith('image/'):
process_image(media, raw_file)
elif media.content_type.startswith('video/'):
process_video(media, raw_file)
def process_image(media, data):
img = PIL.Image.open(io.BytesIO(data))
processed = {}
for name, size in [('thumbnail', (150,150)), ('medium', (800,800)), ('large', (2000,2000))]:
resized = img.copy()
resized.thumbnail(size, PIL.Image.LANCZOS)
key = f'media/{name}/{media.id}.jpg'
s3.put_object(PROCESSED_BUCKET, key, resized.tobytes(), 'image/jpeg')
processed[name] = key
db.update(media.id, status='READY', processed_keys=processed)
CDN Serving
Processed media in S3 is served via CloudFront. S3 bucket is private; CloudFront uses an Origin Access Identity. Public URLs:
https://media.example.com/thumbnail/{media_id}.jpg
→ CloudFront → S3: media/thumbnail/{media_id}.jpg
Cache-Control: public, max-age=31536000, immutable
# Content never changes (media_id is UUID); safe to cache forever
On READY: the API returns the CDN URL. The CDN URL is stored in the processed_keys JSONB as the full URL for the client to use directly. Raw uploads bucket is never exposed publicly.
Key Design Decisions
- Presigned S3 URLs — client uploads directly to S3, zero load on application servers
- Magic byte MIME validation — file extension lies; magic bytes don’t
- Async processing queue — upload completes immediately; resizing/transcoding happens in background
- Separate raw and processed buckets — raw bucket is private (contains unvalidated content); processed is CDN-accessible
- Immutable CDN cache headers — media_id is a UUID, content never changes; safe for long-lived caching
Media upload and processing systems are core to Meta system design interview guide.
Large-scale media upload architecture is discussed in Snap system design interview questions.
Media storage and CDN delivery design is covered in Google system design interview preparation.