Core Requirements
A file sharing platform allows users to upload, organize, and share files and folders. Key features: file upload/download, folder hierarchy, sharing with fine-grained permissions, real-time sync across devices, and version history. Scale: Google Drive stores 15 billion files. Dropbox syncs 1+ billion files per day. Challenges: efficient storage (deduplication), sync (detecting changes across devices), and real-time collaboration. This is a common system design interview question at Dropbox, Google, and Microsoft.
Storage Architecture
Files are stored in object storage (S3, GCS) — not in a relational database (binary blobs don’t belong in SQL). Schema: File: file_id, owner_id, name, mime_type, size_bytes, storage_key (S3 object key), content_hash (SHA256 of file content), created_at, modified_at, is_trashed. FileVersion: version_id, file_id, storage_key, content_hash, size_bytes, modified_at, modified_by. FolderHierarchy: node_id, parent_id, name, type (FILE, FOLDER), owner_id, path (materialized path for fast tree queries). Content-addressed storage: the storage_key is derived from the content_hash (e.g., hash[:2]/hash[2:4]/hash). Deduplication: if two users upload the same file (same SHA256), only one copy is stored in S3. The metadata (File table) points to the same storage_key. This is “client-side deduplication” when the client sends the hash before uploading, or “server-side deduplication” when the server checks after upload. Storage savings: for office documents and photos, deduplication typically achieves 30-60% storage reduction.
Upload Flow with Chunked Transfer
For large files (> 5MB): split into 5MB chunks. Upload each chunk independently (resumable). After all chunks are uploaded: server assembles them (S3 multipart upload). Resumable upload: the client computes chunk hashes before uploading. Client sends the list of chunk hashes to the server. Server returns which chunks are already present (dedup check). Client uploads only the missing chunks. This is Dropbox’s “block-based sync” — for a 100MB file that changed only 1MB, only the modified chunks need to be uploaded. Direct-to-S3 upload: client gets a pre-signed S3 URL from the server, uploads directly to S3 without routing through your servers (avoids server bandwidth and processing costs). Server receives completion notification via S3 event notification or direct callback from the client. Integrity check: server verifies the final SHA256 of the assembled file matches the client-provided hash.
Sync Protocol
Sync detects changes on a device and propagates them to other devices. File system watcher: the desktop client watches for file system events (create, modify, delete, rename) using OS APIs (inotify on Linux, FSEvents on macOS, FileSystemWatcher on Windows). Change detection: on each event, compute the file’s hash. Compare to the last known server hash. If different: upload the changed chunks. Server notification to other devices: the server publishes the change event to a long-poll endpoint or WebSocket channel keyed to the user_id. All connected clients receive the notification and download the updated file/chunks. Conflict resolution: two devices modify the same file while offline. On reconnect, both submit changes. The server timestamps changes. The later change wins (last-writer-wins). The earlier version is saved as a conflict copy (“file (John’s conflicted copy 2024-01-15).docx”) and both are shown to the user. Sync ordering: process changes in the order they happened (tracked by a monotonic version counter per file).
Permissions and Sharing
Permission model: OWNER (full control, can delete), EDITOR (read + write), VIEWER (read only), COMMENTER (read + comment). Share: user shares a file or folder with another user (or via a public link). Schema: Permission: permission_id, node_id, grantee_id (user or group), access_level, granted_by, granted_at. Public link: Link: link_id, node_id, token (random 32-byte URL-safe string), access_level, expires_at, view_count. Inherited permissions: when a folder is shared, all descendants inherit the folder’s permissions. Efficient inheritance check: for file F in folder path /A/B/C/F: check permissions on F, then C, B, A in order (first match wins). Materialized path: nodes store their full path (e.g., “/root/A/B/C”) enabling O(1) path queries. Permission cache: cache user→file permission in Redis for 5 minutes (most files are accessed repeatedly). Invalidate on permission changes. For the public link: validate token in Redis for fast, database-free lookup. Include rate limiting (100 req/min per link) in the link validation middleware.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does Dropbox’s block-based sync minimize upload bandwidth?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Dropbox divides each file into fixed-size blocks (typically 4MB). When a file changes, only the modified blocks need to be re-uploaded — not the entire file. Example: a 100MB PowerPoint presentation where only 2 slides changed: only ~8MB of modified blocks are uploaded instead of 100MB. Algorithm: (1) Client computes the SHA256 hash of each block of the current file. (2) Client sends the list of block hashes to the server. (3) Server checks which block hashes it already has (content-addressed storage — same hash = same content). (4) Server returns a list of missing blocks. (5) Client uploads only the missing blocks. The efficiency gain: for incremental changes (edits to existing files), upload size drops by 90-99%. For the first upload (all blocks are new): no savings, full upload required. This approach also enables efficient deduplication: if two users upload the same file, each block is stored only once regardless of which user uploaded it. Dropbox’s “Infinite Files” (delta compression): for very similar files, Dropbox further compresses uploads by computing binary diffs between versions and uploading only the delta.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle file sync conflicts when the same file is edited on two offline devices?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Conflict detection: when a device reconnects after offline editing, the client compares the file’s local version (last_synced_at, local_hash) to the server’s current version (server_hash, server_modified_at). Conflict exists if: local_hash != server_hash AND server_modified_at > last_synced_at (the server has a newer version that we didn’t base our changes on). Last-writer-wins: the server keeps the newer modification as the primary version. The older version is saved as a conflict copy: “document (John’s conflicted copy 2024-01-15).docx”. Both files are synced to all devices. User resolves manually. This is the Dropbox model — simple, transparent. Operational Transformation / CRDT: for text files or Google Docs, apply OT or CRDT to merge the two edits automatically (more complex, produces a merged result). Folder rename conflicts: if two devices rename the same folder to different names, keep both names — one as primary, one as a conflict rename. Git-style merge: for developer tools (VS Code sync, dotfiles), offer a three-way merge UI. All strategies except last-writer-wins require understanding the document type and semantic content.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement efficient folder traversal for syncing an entire folder tree?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Problem: on first sync or after a long offline period, the client needs to reconcile its entire local folder tree with the server’s folder tree. Naive approach: walk the server tree node-by-node (DFS), one API call per node — too many round trips. Efficient approaches: (1) Delta sync API: the server maintains a cursor (an opaque token representing a point in the change history). Client sends GET /delta?cursor=C. Server returns all changes since cursor C (additions, deletions, renames) and the new cursor. Client applies the changes. Next sync: send the new cursor. This is O(changes since last sync), not O(total files). Dropbox’s API uses exactly this approach. (2) Merkle tree hash: each folder has a hash (a function of its contents’ hashes). Client sends its local root hash. If it matches the server’s root hash: tree is in sync, no work needed. If different: recursively compare subtrees — only descend into subtrees whose hashes differ. O(changed nodes + depth) comparisons, not O(total nodes). The Merkle tree approach is efficient for large trees with few changes.”
}
},
{
“@type”: “Question”,
“name”: “How do you design permission inheritance for shared folders?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A shared folder grants access to all files within it. When checking if user U can access file F: find F’s permission (direct grant or inherited from its ancestor folder). Check the ancestor chain: F u2192 parent_folder u2192 grandparent u2192 … u2192 root. The first permission grant found (starting from the most specific) wins. If no grant found up the chain: access denied. Efficient permission check with materialized path: store each node’s full path (e.g., “/root/shared-folder/subfolder/file.pdf”). To check access: SELECT * FROM permissions WHERE :file_path LIKE (node_path || ‘%’) AND grantee_id=:user ORDER BY LENGTH(node_path) DESC LIMIT 1. The longest matching path wins (most specific permission). Index: (grantee_id, node_path) for efficient per-user permission queries. Override permissions: a subfolder or file can have more restrictive permissions than its parent (sharing a folder but not sharing a specific sensitive file within it). Store the file-level deny permission explicitly. Check denies before grants: if there’s a DENY at the file level, reject even if the parent folder has a GRANT.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement file version history efficiently?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Every save creates a new version. For 1M users each with 1K files and 10 versions each: 10 billion version records. Storage strategies: (1) Full copy per version: each version stores the complete file. Simplest; allows restoring any version without reconstruction. Space cost: 10 versions of a 10MB file = 100MB. Acceptable for modern storage costs if old versions are moved to cheaper storage (S3 Glacier). (2) Delta compression: store the diff between consecutive versions (binary diff, xdelta). Reconstruct version N by applying diffs forward from version 1. Storage savings: 80-95% for text-heavy files. Reconstruction cost: O(N diffs) for version N. Optimize: create a full snapshot every 10 versions to bound reconstruction cost. (3) Block deduplication: if the block-based sync is already in use, version history is nearly free — each version just stores a new manifest (list of block hashes). Unchanged blocks (same hash) are shared across versions. Only changed blocks consume additional storage. This is the most efficient approach for large files. Version retention policy: free tier keeps 30 days of versions; paid tiers keep 180 days or unlimited. Run a daily cleanup job to delete versions older than the policy window.”
}
}
]
}
Asked at: Databricks Interview Guide
Asked at: Netflix Interview Guide
Asked at: Apple Interview Guide
Asked at: Airbnb Interview Guide