How does file chunking enable deduplication and resumable uploads?

Files are split into fixed-size chunks (4–8 MB each). Each chunk is identified by its SHA-256 content hash. Deduplication: before uploading, the client sends all chunk hashes to the server. The server checks which hashes already exist in its chunk store and returns only the missing ones. If another user already uploaded the same file (or the same file block), those chunks are already present — no upload needed. For a 100 GB video where only 5 MB changed, only 1–2 chunks are uploaded. Resumable uploads: track which chunks were successfully stored. On connection failure, resume from the first missing chunk — never re-upload confirmed chunks. Both benefits come directly from content-addressed chunk storage.

How do you efficiently sync files across multiple devices?

Each device maintains a local sync state: (file_id, version). On startup and via real-time notifications (WebSocket), it detects changes. Upload flow: local file changes → compute chunk hashes → ask server which chunks are new → upload new chunks to S3 via pre-signed URLs → notify server of completion. Download flow: device receives "file X version Y" notification → fetch updated chunk list from metadata API → determine which chunks are missing locally → download missing chunks from S3 (CDN-accelerated) → reassemble file. Delta sync is efficient because only changed chunks are transferred. For offline devices: when they reconnect, they fetch the latest version of all modified files. Conflict handling (both devices edited offline): Dropbox creates a "conflicted copy" keeping both versions — simpler than auto-merging.

Why use pre-signed S3 URLs instead of routing uploads through your app servers?

Pre-signed S3 URLs allow clients to upload files directly to S3 without going through your application servers. The metadata server generates a time-limited URL (15–60 minutes) signed with AWS credentials, returns it to the client, and the client uploads directly. Benefits: (1) App servers don't become a bandwidth bottleneck — S3 handles gigabytes of concurrent uploads natively. (2) Reduced infrastructure cost — no need to scale app servers for raw throughput. (3) Better upload performance — S3's global infrastructure is optimized for large object uploads with multipart upload API. (4) Simplified architecture — S3 handles retries, checksums, and durability automatically. The trade-off: you lose the ability to validate content server-side before storage (solve with S3 event triggers that scan uploaded chunks).

System Design Interview: Design a Cloud File Storage System (Dropbox/Google Drive)

⏱ 6 min read

What Is a Cloud File Storage System?

A cloud file storage system lets users upload, sync, and share files across devices. Dropbox, Google Drive, and OneDrive collectively serve hundreds of millions of users with billions of files. The core challenges are: efficient upload/download (chunking large files), sync across devices (detecting changes, merging conflicts), deduplication (storing each unique file once), and massive scale (petabytes of data).

Stripe Interview Guide

Atlassian Interview Guide

Airbnb Interview Guide

Shopify Interview Guide

Apple Interview Guide

Netflix Interview Guide

System Requirements

Functional

Upload and download files of any size
Sync files across multiple devices automatically
Share files/folders with other users
View revision history and restore previous versions
Offline access: read cached files when disconnected

Non-Functional

Durability: 99.999999999% (11 nines) — zero data loss
Availability: 99.99% — files accessible globally
Upload performance: large file uploads resumable after network interruption
Scale: petabytes of data, billions of files

Core Data Model

users: id, email, storage_quota, used_bytes
files: id, owner_id, name, parent_folder_id, size, content_hash, created_at
file_versions: id, file_id, version_number, chunk_hashes[], created_at
chunks: content_hash (PK), size, s3_key
folder_shares: folder_id, shared_with_user_id, permission (read/write)
device_sync_state: device_id, file_id, synced_version

File Chunking — The Most Important Design Decision

Split large files into chunks (4–8 MB each). Store each chunk separately in object storage (S3, GCS). Why?

Resumable uploads: if upload fails at chunk 5 of 100, restart from chunk 5, not the beginning.
Deduplication: if chunk content hash already exists in the chunk store, don’t upload it again. A 100 GB video where only 5 MB changed: upload only 1–2 changed chunks instead of the entire file.
Delta sync: when a file changes, identify which chunks changed (by comparing content hashes) and upload only those chunks.
Parallel upload: upload multiple chunks concurrently, maximizing bandwidth utilization.

Chunk Upload Flow

Client splits file into 4 MB chunks
For each chunk, compute SHA-256 hash
Client sends chunk hashes to the metadata API: “which of these chunks do you already have?”
API returns list of missing chunks (the ones not yet uploaded — deduplication in action)
Client gets pre-signed S3 URLs for each missing chunk
Client uploads missing chunks directly to S3 (bypasses app servers — not a bottleneck)
Client notifies metadata API: “all chunks uploaded, here are the chunk hashes in order”
API records the new file version: file_versions row + file.content_hash update

Metadata Service

The metadata service is a standard RDBMS (PostgreSQL) tracking file hierarchy, versions, and chunk manifests. It is NOT used for file content — only for metadata. At Dropbox scale (~500M users), shard the metadata database by user_id. Use a separate read replica for listing folders and search.

Sync Protocol

After a file change, other devices must sync. Long-polling or WebSocket connection per client device to a notification service. When a file changes:

Uploader notifies the server (via the upload completion API call)
Server publishes a “file_changed” event to Kafka
Notification service consumes Kafka events, pushes to connected devices via WebSocket
Device receives notification: “file X version Y changed”
Device calls metadata API to get the new chunk list
Device checks which chunks it already has (locally cached)
Device downloads only missing chunks from S3 (CDN-accelerated)

Conflict Resolution

Two devices edit the same file offline. Both upload conflicting versions. Strategy (Dropbox’s approach): both versions are kept. Create a “conflicted copy” named “file (John’s conflicted copy 2024-01-15).” Show both to the user. Simpler than auto-merging, which only works for text files. For Google Docs: use OT (Operational Transformation) to merge edits — but this requires a collaborative editor architecture, not a file system.

Deduplication at Scale

Content-addressed storage: chunk S3 key = SHA-256(chunk_content). If two users upload the same file (e.g., the same movie), the chunks are stored once but referenced from both users’ file_versions. The chunks table is a global content store — a chunk with a given hash is uploaded once and referenced indefinitely. This reduces storage costs dramatically — Dropbox estimated 25–30% storage savings from deduplication.

Object Storage (S3) Architecture

Store chunks in S3 with the content hash as the key
Use S3 lifecycle policies: move infrequently accessed chunks to S3 Glacier (cheaper) after 90 days
Cross-region replication for durability (S3 already replicates within a region across AZs)
CDN (CloudFront) caches popular chunks at edge locations — fast downloads globally
Pre-signed URLs: the metadata server issues time-limited (15-minute) S3 pre-signed URLs for upload and download — clients talk directly to S3, not through your servers

Version History and Retention

Keep all file_versions rows. Keep all chunk data forever (it’s deduplicated — storage cost per version is low). Soft-delete files (mark deleted, don’t remove chunks immediately). Allow restore within 30-180 days. Garbage collect unreferenced chunks after all file versions referencing them are permanently deleted.

Interview Tips

Chunking is the central idea — explain it first, derive deduplication and resumable uploads from it naturally.
Pre-signed S3 URLs are key for scalability: your app servers aren’t in the upload/download path, so they don’t bottleneck on bandwidth.
Conflict resolution: be upfront that auto-merge is hard (only text files, and even then it’s complex) — Dropbox’s “conflicted copy” is the pragmatic answer.
For the sync protocol, mention WebSockets + Kafka fan-out — this is the same pattern as notifications systems.