Question 1

Why is a minimum 5MB chunk size required for S3 multipart uploads?

Accepted Answer

AWS S3 enforces a minimum part size of 5MB for all parts except the last part of a multipart upload. This constraint exists because S3 multipart is optimized for large files — small parts would create too many S3 object metadata entries. If you use 1MB chunks for a 100MB file, you would create 100 parts, which is within S3's 10,000-part limit but triggers the 5MB minimum constraint. The last part can be any size (even 1 byte). Design your chunk size as max(5MB, file_size / 10000) to handle all file sizes within S3's limits.

Question 2

How do chunk checksums prevent silent data corruption during upload?

Accepted Answer

Network transmission and disk I/O can silently corrupt bytes (bit flips, truncated writes). Without verification, a corrupt chunk is assembled into the final file and the corruption is discovered only when the user tries to use the file. Client-side: compute MD5 of the chunk bytes before upload, send as X-Checksum-MD5 header. Server-side: recompute MD5 of received bytes and compare. Mismatch → reject the chunk with a 400 error → client re-uploads that chunk. This detects corruption before it is assembled. S3 also computes an ETag (MD5 of the part) — use it to verify the part arrived intact.

Question 3

How does a client resume a partially completed chunked upload?

Accepted Answer

The upload session in the database tracks which chunks have been uploaded (UploadChunk rows with uploaded_at set). The client calls GET /uploads/sessions/{id}/status to get the list of missing_chunks. The response includes which chunk numbers are still needed. The client re-uploads only the missing chunks using the same session_id and the same idempotency key. Server-side: ON CONFLICT (session_id, chunk_number) DO UPDATE so re-uploading a chunk that was already received is a safe no-op. This makes resume operations fully idempotent.

Question 4

How do you complete an S3 multipart upload?

Accepted Answer

Collect all ETags returned by S3 when each part was uploaded (stored in UploadChunk.s3_etag). Call s3.complete_multipart_upload with the list of {PartNumber, ETag} objects sorted by PartNumber. S3 assembles the parts in order into the final object. If any part is missing from the list, S3 returns an error. After completion, the S3 multipart upload ID is no longer valid — you cannot add more parts. If the completion call fails, the parts still exist in S3 and you can retry the complete_multipart_upload call — it is idempotent if the same parts/ETags are provided.

Question 5

How do you handle abandoned upload sessions?

Accepted Answer

Run a background job daily: find UploadSession records where status=in_progress AND created_at < NOW()-24h. For each, call s3.abort_multipart_upload(Bucket, Key, UploadId). This cleans up the S3 partial parts which otherwise incur storage costs even though no final object is created. Then mark the session status=expired. S3 also supports lifecycle rules to automatically abort incomplete multipart uploads after N days — set this as a safety net. Without cleanup, abandoned uploads accumulate indefinitely: a 1GB file where only 200MB was uploaded still costs $0.023*0.2 = ~$0.005/month in perpetuity.

Chunked File Upload System Low-Level Design

Chunked File Upload System — Low-Level Design

Core Data Model

Phase 1: Initiate Upload Session

Phase 2: Upload Individual Chunks

Phase 3: Complete Upload

Resuming an Interrupted Upload

Key Interview Points