Problem Overview
Design a file storage and sync service like Dropbox or Google Drive. Users can upload files up to 50GB, sync files across devices, share files with collaborators, and access files offline. The system must handle 500 million users and 1 billion files with high availability and eventually consistent sync.
Core Architecture Components
- Block service: splits files into chunks and stores them in object storage
- Metadata service: tracks file names, versions, block references, and sharing permissions
- Sync service: detects local changes and propagates them to other devices
- Notification service: pushes change events to clients via WebSocket or long-poll
File Chunking and Deduplication
Splitting files into fixed-size chunks (4MB is typical) enables three critical optimizations: (1) Resumable uploads: if a 2GB upload fails at 80%, only the remaining chunks must be re-uploaded, not the entire file. (2) Delta sync: when a file changes, only modified chunks are re-uploaded. A 100MB spreadsheet with one cell changed uploads one 4MB chunk, not 100MB. (3) Cross-user deduplication: if two users upload the same file (or the same PDF is in 100 users folders), only one copy is stored. The chunk is identified by its SHA-256 hash; before uploading, the client checks if the server already has a chunk with that hash. If so, the upload is skipped entirely (zero-byte upload for identical chunks).
def upload_file(filepath):
chunks = split_into_chunks(filepath, chunk_size=4 * 1024 * 1024)
block_ids = []
for chunk in chunks:
block_id = sha256(chunk)
if not block_service.exists(block_id):
block_service.upload(block_id, chunk)
block_ids.append(block_id)
metadata_service.create_file(
name=filepath.name,
block_ids=block_ids,
size=filepath.size
)
Metadata Service
The metadata service stores the mapping from file identity to block list, plus version history, permissions, and sync state. Schema: files(id, owner_id, name, parent_folder_id, current_version_id), file_versions(id, file_id, block_ids, size, created_at, device_id), blocks(block_id, s3_path, size, ref_count). Using a relational database for metadata enables strong consistency for file operations (rename, move, share) and efficient permission queries. Block storage (S3) handles the binary data. This separation is the key architectural pattern: metadata is small and relational; binary data is large and blob-like.
Sync Protocol
The client runs a background process (the sync agent) that: (1) Watches the local file system for changes using OS-level events (inotify on Linux, FSEvents on macOS, ReadDirectoryChangesW on Windows). (2) On change: computes the delta — which chunks changed. Uploads new/modified chunks. Updates metadata service with the new version. (3) Long-polls or maintains a WebSocket connection to the notification service. When another device uploads a change, the notification service pushes a change event. The sync agent fetches the new metadata, downloads only the changed chunks, and applies the delta to the local file.
Conflict Resolution
Conflicts occur when two devices modify the same file while both are offline. Dropbox resolves conflicts by keeping both versions: the server version stays at its path; the client version is saved as “filename (conflicted copy 2024-01-15).ext”. No data is lost; users resolve conflicts manually. Google Drive uses a last-write-wins strategy for binary files but handles conflicts in Google Docs via OT (Operational Transformation). For sync services, the Dropbox approach is safer — automatic merging of arbitrary binary files is impossible without format-specific logic.
Scale and Storage Estimation
500 million users, average 10GB stored per user: 5 exabytes total storage. Deduplication reduces this significantly — Dropbox reported 30% deduplication across its user base. 5 exabytes * 0.7 = 3.5 exabytes effective storage. This is distributed across S3 or equivalent object storage in multiple availability zones. Metadata: 1 billion files * 1KB metadata each = 1TB — easily fits in a distributed relational database (PostgreSQL with read replicas and partitioning by owner_id). Daily uploads: 500M users * 1 file change/day average * 4MB/change = 2 petabytes/day write throughput.
Interview Tips
- Chunking + hashing for deduplication is the core insight — mention it early
- Separate metadata (SQL) from binary data (object storage) — this is the expected architecture
- Delta sync: only modified chunks are uploaded — critical for large file efficiency
- Conflict resolution policy matters: clarify whether automatic merge or copy-on-conflict is acceptable
- Offline support: local cache of recent files + sync queue for pending uploads
Frequently Asked Questions
How does Dropbox sync files efficiently across devices?
Dropbox uses a chunking strategy: files are split into 4MB chunks identified by their SHA-256 hash. When a file changes, only the modified chunks are uploaded — not the entire file. This delta sync means a 500MB video file with a small metadata edit uploads only 4MB. Chunks are content-addressed, so if two users upload identical files (or identical chunks), only one copy is stored in S3. A client-side sync agent watches the file system for changes using OS events (FSEvents on macOS, inotify on Linux), computes the chunk delta, and uploads changed blocks. Other devices receive a push notification (via WebSocket) containing the new metadata, then download only the changed chunks.
How do you handle file conflicts in a sync service?
File conflicts occur when the same file is modified on two devices while both are offline. There are two main resolution strategies: (1) Copy-on-conflict (Dropbox approach): keep both versions. The server version stays at its original path; the conflicting local version is saved as "filename (conflicted copy 2024-01-15).ext". No data is lost; users resolve the conflict manually by comparing files. (2) Last-write-wins: whichever version has a later modification timestamp is kept; the other is discarded. Simple but can silently lose data. For generic file sync, copy-on-conflict is safer because automatic merging of arbitrary binary files (images, PDFs, executables) is impossible. Format-specific merge (like Google Docs uses OT for text) requires content-aware logic.
How does file deduplication work in cloud storage?
Deduplication stores only one copy of identical data, even if multiple users upload it. Block-level deduplication: files are split into fixed-size chunks (4MB typical). Each chunk is hashed (SHA-256). Before uploading, the client asks the server: "do you have a block with this hash?" If yes, the upload is skipped entirely — only a reference to the existing block is stored in the metadata. This is called a "hashcheck" or "block existence check." Storage savings depend on the dataset: Dropbox reported ~30% storage reduction from deduplication across its user base. For backup systems with many similar files (database snapshots, VM images), deduplication can reduce storage by 80-90%. The metadata service tracks reference counts per block for garbage collection — blocks with zero references can be deleted.
{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “How does Dropbox sync files efficiently across devices?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Dropbox uses a chunking strategy: files are split into 4MB chunks identified by their SHA-256 hash. When a file changes, only the modified chunks are uploaded — not the entire file. This delta sync means a 500MB video file with a small metadata edit uploads only 4MB. Chunks are content-addressed, so if two users upload identical files (or identical chunks), only one copy is stored in S3. A client-side sync agent watches the file system for changes using OS events (FSEvents on macOS, inotify on Linux), computes the chunk delta, and uploads changed blocks. Other devices receive a push notification (via WebSocket) containing the new metadata, then download only the changed chunks.” } }, { “@type”: “Question”, “name”: “How do you handle file conflicts in a sync service?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “File conflicts occur when the same file is modified on two devices while both are offline. There are two main resolution strategies: (1) Copy-on-conflict (Dropbox approach): keep both versions. The server version stays at its original path; the conflicting local version is saved as “filename (conflicted copy 2024-01-15).ext”. No data is lost; users resolve the conflict manually by comparing files. (2) Last-write-wins: whichever version has a later modification timestamp is kept; the other is discarded. Simple but can silently lose data. For generic file sync, copy-on-conflict is safer because automatic merging of arbitrary binary files (images, PDFs, executables) is impossible. Format-specific merge (like Google Docs uses OT for text) requires content-aware logic.” } }, { “@type”: “Question”, “name”: “How does file deduplication work in cloud storage?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Deduplication stores only one copy of identical data, even if multiple users upload it. Block-level deduplication: files are split into fixed-size chunks (4MB typical). Each chunk is hashed (SHA-256). Before uploading, the client asks the server: “do you have a block with this hash?” If yes, the upload is skipped entirely — only a reference to the existing block is stored in the metadata. This is called a “hashcheck” or “block existence check.” Storage savings depend on the dataset: Dropbox reported ~30% storage reduction from deduplication across its user base. For backup systems with many similar files (database snapshots, VM images), deduplication can reduce storage by 80-90%. The metadata service tracks reference counts per block for garbage collection — blocks with zero references can be deleted.” } } ] }