Dropbox syncs files across devices for 700+ million users, handling billions of file operations daily. Designing a file storage and sync system tests your understanding of chunking, deduplication, conflict resolution, and efficient delta synchronization. This guide covers the architecture from file upload to cross-device sync — a classic system design interview question that goes deeper than most candidates expect.
Architecture Overview
Core services: (1) Metadata service — stores file/folder hierarchy, permissions, versions, and chunk manifests. The single source of truth for what files exist and their current state. PostgreSQL sharded by user_id. (2) Block storage service — stores the actual file content as chunks in object storage (S3/GCS). Each chunk is content-addressed: its key is the SHA-256 hash of its content. (3) Sync service — coordinates changes between devices. When a file changes on one device, the sync service notifies all other devices and orchestrates the transfer of changed chunks. (4) Notification service — pushes real-time change notifications to connected clients via WebSocket or long polling. Upload flow: client detects a file change -> computes chunk hashes -> sends the manifest (list of chunk hashes) to the metadata service -> metadata service identifies which chunks are new (not already in storage) -> client uploads only new chunks to block storage -> metadata service updates the file version -> notification service alerts other devices -> other devices download the new/changed chunks.
File Chunking
Files are split into chunks before upload. Dropbox uses 4 MB fixed-size chunks. Benefits: (1) Delta sync — when a user modifies a file, only the changed chunks need to be uploaded. For a 1 GB document where one paragraph is edited, only 1-2 chunks (4-8 MB) are uploaded instead of the full 1 GB. (2) Resumable uploads — if the connection drops during a large file upload, resume from the last successfully uploaded chunk. No need to restart. (3) Parallel upload/download — upload multiple chunks concurrently for higher throughput. (4) Deduplication — identical chunks (same content hash) are stored once regardless of how many files contain them. Variable-size chunking (Rabin fingerprinting) is more efficient than fixed-size for detecting changes. Rabin chunking defines chunk boundaries based on content (a specific byte pattern triggers a boundary). When bytes are inserted at the beginning of a file, variable-size chunks shift minimally — most boundaries remain the same. Fixed-size chunks all shift, creating false mismatches. Dropbox uses variable-size chunking internally for better deduplication and delta sync efficiency.
Content-Addressed Deduplication
Each chunk is identified by its content hash: SHA-256(chunk_bytes). If two files (or two users) have identical chunks, the chunk is stored once. The metadata service stores the file as a list of chunk hashes, not the actual data. Deduplication levels: (1) Cross-user — if 1000 users sync the same popular PDF, the chunks are stored once. Dropbox reported 75%+ cross-user deduplication for certain file types. (2) Cross-version — when a file is updated, unchanged chunks from the previous version are already stored. Only new chunks consume storage. (3) Cross-file — a photo album with duplicated images across folders stores each unique image chunk once. Storage savings: deduplication typically reduces total storage by 20-60% across the user base. For enterprise accounts with many users sharing documents, savings can exceed 80%. Upload optimization: before uploading a chunk, the client sends the chunk hash to the server. If the hash already exists (the chunk is already stored), the server responds “already have it” and the client skips the upload. This is “instant upload” for known content — uploading a popular file takes seconds instead of minutes because no data is actually transferred.
Sync Protocol and Conflict Resolution
Sync protocol: each client maintains a sync cursor — a server-assigned sequence number representing the last known state. When a client reconnects: send the cursor to the metadata service. The server returns all changes since that cursor (new files, modified files, deleted files). The client applies changes and updates its cursor. Real-time sync: connected clients receive change notifications via WebSocket within seconds of a change. The notification contains the file path and new version — the client then fetches the updated chunk manifest and downloads changed chunks. Conflict resolution: if two devices edit the same file while disconnected: (1) Both upload their changes with the base version number. (2) The metadata service detects the conflict (two updates based on the same version). (3) Resolution: one version wins (typically the later timestamp), and the other is saved as a “conflicted copy” with the device name and timestamp appended to the filename. The user sees both files and manually resolves. This is simpler than OT/CRDT approaches but works well for file-level conflicts. For collaborative editing within a file (Google Docs-style), use OT or CRDT at the character level (covered in our Collaborative Editing guide).
Sharing and Permissions
Sharing model: a user can share a file or folder with other users or via a public link. Permission levels: viewer (read-only), editor (read-write), and owner (full control including sharing and deletion). Shared folders: when User A shares a folder with User B, the folder appears in B namespace. Changes by either user sync to both. The metadata service maintains an ACL (Access Control List) per file/folder: list of (user_id, permission_level). Every file access checks the ACL. Public links: a signed URL with an embedded token. The token encodes the file path and permission level. Anyone with the link can access the file without authentication. Links can be password-protected, set to expire, and restricted to view-only (no download). Link revocation: changing the token invalidates the old link immediately. Team/Enterprise features: admin controls for: preventing sharing outside the organization, requiring passwords on shared links, auditing all file access, and remote wipe (delete synced files from a lost/stolen device). Storage quotas: per-user or per-team storage limits enforced at the metadata service. Uploads exceeding the quota are rejected.