Requirements
Design a file storage system like Google Drive or Dropbox supporting:
- Upload and download files up to 50 GB
- File versioning and restore
- Sharing with granular permissions
- Real-time sync across devices
- 1 billion users, millions of concurrent uploads
Data Model
Three core tables handle files, versions, and deduplication:
File(file_id, owner_id, name, parent_folder_id, created_at, is_deleted)
FileVersion(version_id, file_id, chunk_manifest JSONB, size_bytes, created_at)
Chunk(chunk_id, sha256_hash, storage_path, ref_count)
SharePermission(file_id, grantee_id, permission_level ENUM('READ','WRITE','ADMIN'))
chunk_manifest is an ordered array of chunk_id values. Reconstructing a file means fetching chunks in order and concatenating them.
Chunked Upload with Deduplication
Large files are split into 4 MB chunks on the client before upload begins:
- Client splits the file and computes SHA-256 for each chunk.
- Client calls
POST /files/upload-sessionwith the list of chunk hashes. Server responds with which hashes it already has (deduplication check). - Client uploads only the missing chunks via
PUT /chunks/{upload_id}/{chunk_index}. - Server stores each chunk in S3 and inserts a row in
Chunktable, or incrementsref_countif the hash already exists. - Client calls
POST /files/commitwith the ordered list of chunk IDs to finalize the file version.
This design means two users uploading the same file only store one copy in object storage — deduplication by content hash.
Resumable Uploads
Network failures are common for large files. The server tracks progress via an UploadSession table:
UploadSession(upload_id, file_id, chunks_expected, chunks_received SET, expires_at)
The client can query GET /upload-session/{upload_id} at any time to learn which chunks were received, then resume by uploading only the missing ones.
Sync Protocol
To sync changes across devices, the server maintains a change_log table:
ChangeLog(change_id, user_id, file_id, change_type, version_id, created_at)
Clients poll GET /changes?since={last_change_id} or receive pushes via WebSocket. Only the changed chunks are transferred (delta sync). The client merges changes locally; conflicts are resolved by last-write-wins on the version timestamp, or flagged for manual resolution.
Sharing and Permissions
The SharePermission table controls access. Public share links use a random opaque token stored in a ShareLink(token, file_id, permission_level, expires_at) table. Permission checks happen at the API gateway before any storage access.
Versioning
Every commit creates a new FileVersion row. Old versions are retained for a configurable period (e.g., 30 days or last 100 versions). A background job soft-deletes expired versions and decrements ref_count on their chunks. When a chunk’s ref_count reaches zero, a GC job removes it from S3.
Storage Stack
- Object storage (S3): raw chunk bytes, addressed by hash
- PostgreSQL: files, versions, chunks, permissions — transactional metadata
- Elasticsearch: full-text search over file names and extracted text content
- Redis: active upload session state, recently accessed chunk manifests
Key APIs
POST /files/upload-session → {upload_id, missing_chunks[]}
PUT /chunks/{upload_id}/{chunk_idx} → 200 OK
POST /files/commit → {file_id, version_id}
GET /files/{file_id} → file metadata + download URL
GET /files/{file_id}/versions → list of FileVersion
GET /changes?since={change_id} → ChangeLog entries
Interview Tips
- Interviewers often ask: what happens if a chunk upload succeeds but commit fails? Answer: upload session is idempotent; client retries commit, server re-checks chunk presence.
- Deduplication works per-chunk, not per-file, so even partially similar files save bandwidth.
- For celebrity files (shared with millions), pre-warm CDN edge caches on commit.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you upload large files efficiently in a file storage system?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Split the file into fixed-size chunks (4MB is common). The client computes SHA-256 of each chunk before uploading. For each chunk, POST /chunks with the hash; the server checks if that hash already exists in the Chunk table (content-addressed deduplication) – if so, return 200 without storing again. If not, stream the bytes to object storage (S3) and record the chunk. Maintain an UploadSession with upload_id and a bitmap of which chunks have been received. When all chunks are confirmed, POST /files/commit with the chunk manifest to assemble the file record. This design enables resumable uploads: if the connection drops, the client retries only the missing chunks identified from the upload session.”}},{“@type”:”Question”,”name”:”How does file deduplication work in a storage system like Dropbox?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Content-addressed deduplication: every 4MB chunk is identified by its SHA-256 hash. The Chunk table has (chunk_hash, storage_path, ref_count). When uploading a chunk, check if chunk_hash already exists. If yes: increment ref_count and reuse the existing stored bytes – no I/O to S3. If no: store to S3 and insert a new Chunk record. At file delete/version GC time: decrement ref_count for each chunk in the manifest; when ref_count reaches 0, schedule the S3 object for deletion. This achieves cross-user deduplication: if two users upload the same file, the bytes are stored once. Dropbox reported 50%+ storage savings from deduplication across their user base.”}},{“@type”:”Question”,”name”:”How do you implement delta sync for a file storage system?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Maintain a ChangeLog table: (change_id auto-increment, user_id, file_id, change_type enum CREATE/UPDATE/DELETE, version_id, created_at). Each device tracks the last change_id it processed (stored locally). On sync: GET /changes?since={last_change_id}&user_id=X. The server returns all changes with change_id > last_change_id. The device applies changes in order: for CREATE/UPDATE, download only the new/modified chunks (compare chunk manifests to determine which chunks changed). For DELETE, remove the local file. Conflict resolution: last-write-wins by default; for advanced cases, create a conflict copy (filename_conflict_2024-01-01.pdf). WebSocket push notifications trigger sync immediately instead of polling.”}},{“@type”:”Question”,”name”:”How do you model file versioning in Google Drive / Dropbox?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Each file edit creates a new FileVersion record: (version_id, file_id, chunk_manifest JSON, size_bytes, created_by, created_at). The chunk_manifest is an ordered array of chunk hashes. The File record points to the current_version_id. To view an older version: fetch the FileVersion and reconstruct from its chunk_manifest. Storage cost of versioning is minimal due to deduplication – if only 10% of chunks changed, only 10% of chunks are stored anew; the unchanged 90% are shared via ref_count. Version retention policy: keep last 30 versions by default, or 180 days. A GC job periodically marks old versions for deletion and decrements ref_counts.”}},{“@type”:”Question”,”name”:”How do you handle file sharing and permissions in a storage system?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Permission model: SharePermission(permission_id, resource_id, resource_type enum FILE/FOLDER, grantee_id, grantee_type enum USER/GROUP/PUBLIC, permission_level enum VIEWER/COMMENTER/EDITOR, created_by, expires_at). Permission inheritance: files inherit permissions from their parent folder unless explicitly overridden. Public share links: generate a random token, store in ShareLink(token, resource_id, permission_level, expires_at). On access: look up the token, verify not expired, grant the corresponding permission level. Permission check on every API call: query SharePermission for (resource_id, requesting_user_id) with inheritance traversal up the folder tree. Cache permission lookups in Redis with TTL=5 minutes to avoid recursive DB queries on every request.”}}]}
Google system design interviews cover Google Drive and file storage. See common questions for Google interview: file storage and Google Drive system design.
Dropbox system design is a canonical file storage interview topic. Review design patterns for Dropbox interview: file sync and storage system design.
Meta system design interviews cover file and media storage at scale. See patterns for Meta interview: file upload and media storage system design.