Media Upload Pipeline Low-Level Design: Chunked Upload, Virus Scan, and Post-Processing

What Is a Media Upload Pipeline?

A media upload pipeline accepts raw files from clients — images, videos, audio, and documents — and transforms them into safe, optimized, CDN-published assets. The design must handle large file sizes reliably, enforce security scanning before any file becomes publicly accessible, orchestrate potentially long-running transcoding jobs, and publish final artifacts without manual intervention. Every stage must be resumable and auditable.

Requirements

Functional

Accept files up to 10 GB using resumable chunked upload so interrupted uploads can continue from the last committed chunk.
Run asynchronous virus and malware scanning on every uploaded file before it is promoted to any user-facing tier.
Dispatch transcoding jobs (video resolution variants, image thumbnails, audio normalization) and track their status.
Publish processed assets to a CDN origin and update the asset record with public URLs.
Expose upload progress and job status to clients via polling or webhook callbacks.

Non-Functional

Upload throughput: sustain 500 MB/s aggregate ingress per region.
Virus scan completion within 60 seconds for files up to 500 MB.
Transcoding jobs retried automatically on worker failure with at-least-once delivery.

Data Model

Upload: upload_id, owner_id, filename, mime_type, total_size_bytes, chunk_size_bytes, total_chunks, status (INITIATED, IN_PROGRESS, ASSEMBLED, SCAN_QUEUED, SCAN_CLEAN, SCAN_INFECTED, PROCESSING, PUBLISHED, FAILED), created_at, updated_at.
Chunk: chunk_id, upload_id, chunk_index, size_bytes, checksum_sha256, storage_key, received_at.
ScanResult: scan_id, upload_id, scanner_vendor, verdict (CLEAN, INFECTED, ERROR), threat_name (nullable), scanned_at.
TranscodeJob: job_id, upload_id, profile (VIDEO_720P, VIDEO_1080P, THUMBNAIL, AUDIO_NORMALIZED), status (PENDING, RUNNING, DONE, FAILED), worker_id, started_at, finished_at, output_storage_key.
Asset: asset_id, upload_id, cdn_url, storage_key, content_type, size_bytes, published_at.

Core Algorithms

Resumable Chunked Upload

The client calls POST /uploads to receive an upload_id and a list of presigned PUT URLs, one per chunk. Each presigned URL targets a unique storage key under a quarantine prefix. The client uploads chunks in parallel (up to four concurrent) and records which chunk_index values have been acknowledged. On network failure the client calls GET /uploads/{id}/chunks to retrieve the list of received chunk indexes and resumes from the first missing one. Once all chunks are confirmed, the client calls POST /uploads/{id}/assemble. The assembly worker concatenates chunks in index order using object storage multipart copy operations, writing the result to a staging prefix, then marks the upload ASSEMBLED.

Virus Scan Orchestration

On ASSEMBLED, a Kafka event triggers the scan dispatcher, which enqueues a scan task and updates status to SCAN_QUEUED. Scan workers pull tasks, stream the assembled file from staging storage to the scanner SDK (ClamAV or a commercial API), and write a ScanResult row. On CLEAN the dispatcher publishes a SCAN_PASSED event and updates status. On INFECTED the file is moved to an isolated quarantine bucket, the owner is notified, and the upload is permanently set to SCAN_INFECTED with no further processing. On scanner ERROR the task is retried up to three times with exponential backoff before entering a manual review queue.

Transcoding Job Dispatch

After a clean scan result, the pipeline reads the mime_type to determine which transcoding profiles apply, creates one TranscodeJob row per profile, and publishes each to a profile-specific Kafka partition so heavy jobs (4K video) do not starve fast jobs (thumbnail). Workers use optimistic locking (UPDATE … WHERE status=PENDING AND worker_id IS NULL LIMIT 1) to claim a job. On completion the output_storage_key is recorded and the CDN origin is notified via a cache purge API. When all jobs for an upload finish, the upload status moves to PUBLISHED and Asset rows are created.

Scalability and Reliability

Presigned URLs: Clients write chunks directly to object storage, bypassing application servers entirely. This eliminates the upload bandwidth bottleneck at the API tier.
Job idempotency: Each TranscodeJob has a deterministic job_id derived from upload_id and profile. Duplicate dispatch events are ignored by checking for existing rows before insert.
Dead-letter queue: Scan and transcode workers publish failed messages to a DLQ after max retries. An operations dashboard exposes DLQ depth and allows manual requeue.
Chunk cleanup: A nightly job deletes individual chunk objects from the quarantine prefix once the assembled file exists, reducing storage costs for large uploads.

API Design

POST /uploads — initiate upload; returns upload_id and presigned chunk URLs.
GET /uploads/{id}/chunks — return list of confirmed chunk indexes for resume.
POST /uploads/{id}/assemble — trigger assembly after all chunks are uploaded.
GET /uploads/{id} — poll upload and job status with progress detail.
POST /uploads/{id}/webhook — register a callback URL for status transition events.
GET /assets/{id} — fetch published asset with CDN URLs per profile.

Key Design Decisions

Using presigned URLs for chunk ingestion offloads network bandwidth from application servers to object storage, which scales independently. Storing each chunk as a separate object rather than appending to a single object allows parallel upload verification without locking. Separating the scan result from the upload status row means the scan audit trail survives even if the upload is later deleted for compliance purposes.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does resumable chunked upload work in a media upload pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The client splits the file into fixed-size chunks and uploads each chunk with a sequential index. The server tracks received chunks in a manifest (e.g., Redis bitmap or database row). If the connection drops, the client queries the server for the last acknowledged chunk and resumes from there, avoiding re-uploading already-received data.”
}
},
{
“@type”: “Question”,
“name”: “How is async virus scanning integrated without blocking the upload response?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Once all chunks are assembled, the object is placed in a quarantine bucket and a scan job is enqueued. A worker calls a ClamAV or vendor AV API against the object. On a clean result the object is promoted to the processing bucket; on a positive match it is deleted and the uploader is notified via webhook or polling endpoint.”
}
},
{
“@type”: “Question”,
“name”: “How do you design transcoding orchestration for uploaded media?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A transcoding orchestrator (e.g., AWS MediaConvert, a custom job queue) receives the clean object path and fans out per-rendition jobs (360p, 720p, 1080p, audio-only). Each job writes its output to object storage and reports status back. The orchestrator aggregates completion events and marks the asset as ready only when all required renditions finish.”
}
},
{
“@type”: “Question”,
“name”: “What's the strategy for CDN publishing after transcoding completes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “After all renditions are confirmed, a manifest file (HLS/DASH) is generated and the renditions plus manifest are copied to a CDN-origin bucket. A cache invalidation or pre-warming request is issued for the manifest URL. The asset record in the database is updated to 'published' with the CDN base URL so clients can begin streaming.”
}
}
]
}