Requirements
Design a file storage system like Google Drive or Dropbox supporting:
- Upload and download files up to 50 GB
- File versioning and restore
- Sharing with granular permissions
- Real-time sync across devices
- 1 billion users, millions of concurrent uploads
Data Model
Three core tables handle files, versions, and deduplication:
File(file_id, owner_id, name, parent_folder_id, created_at, is_deleted)
FileVersion(version_id, file_id, chunk_manifest JSONB, size_bytes, created_at)
Chunk(chunk_id, sha256_hash, storage_path, ref_count)
SharePermission(file_id, grantee_id, permission_level ENUM('READ','WRITE','ADMIN'))
chunk_manifest is an ordered array of chunk_id values. Reconstructing a file means fetching chunks in order and concatenating them.
Chunked Upload with Deduplication
Large files are split into 4 MB chunks on the client before upload begins:
- Client splits the file and computes SHA-256 for each chunk.
- Client calls
POST /files/upload-sessionwith the list of chunk hashes. Server responds with which hashes it already has (deduplication check). - Client uploads only the missing chunks via
PUT /chunks/{upload_id}/{chunk_index}. - Server stores each chunk in S3 and inserts a row in
Chunktable, or incrementsref_countif the hash already exists. - Client calls
POST /files/commitwith the ordered list of chunk IDs to finalize the file version.
This design means two users uploading the same file only store one copy in object storage — deduplication by content hash.
Resumable Uploads
Network failures are common for large files. The server tracks progress via an UploadSession table:
UploadSession(upload_id, file_id, chunks_expected, chunks_received SET, expires_at)
The client can query GET /upload-session/{upload_id} at any time to learn which chunks were received, then resume by uploading only the missing ones.
Sync Protocol
To sync changes across devices, the server maintains a change_log table:
ChangeLog(change_id, user_id, file_id, change_type, version_id, created_at)
Clients poll GET /changes?since={last_change_id} or receive pushes via WebSocket. Only the changed chunks are transferred (delta sync). The client merges changes locally; conflicts are resolved by last-write-wins on the version timestamp, or flagged for manual resolution.
Sharing and Permissions
The SharePermission table controls access. Public share links use a random opaque token stored in a ShareLink(token, file_id, permission_level, expires_at) table. Permission checks happen at the API gateway before any storage access.
Versioning
Every commit creates a new FileVersion row. Old versions are retained for a configurable period (e.g., 30 days or last 100 versions). A background job soft-deletes expired versions and decrements ref_count on their chunks. When a chunk’s ref_count reaches zero, a GC job removes it from S3.
Storage Stack
- Object storage (S3): raw chunk bytes, addressed by hash
- PostgreSQL: files, versions, chunks, permissions — transactional metadata
- Elasticsearch: full-text search over file names and extracted text content
- Redis: active upload session state, recently accessed chunk manifests
Key APIs
POST /files/upload-session → {upload_id, missing_chunks[]}
PUT /chunks/{upload_id}/{chunk_idx} → 200 OK
POST /files/commit → {file_id, version_id}
GET /files/{file_id} → file metadata + download URL
GET /files/{file_id}/versions → list of FileVersion
GET /changes?since={change_id} → ChangeLog entries
Interview Tips
- Interviewers often ask: what happens if a chunk upload succeeds but commit fails? Answer: upload session is idempotent; client retries commit, server re-checks chunk presence.
- Deduplication works per-chunk, not per-file, so even partially similar files save bandwidth.
- For celebrity files (shared with millions), pre-warm CDN edge caches on commit.
Google system design interviews cover Google Drive and file storage. See common questions for Google interview: file storage and Google Drive system design.
Dropbox system design is a canonical file storage interview topic. Review design patterns for Dropbox interview: file sync and storage system design.
Meta system design interviews cover file and media storage at scale. See patterns for Meta interview: file upload and media storage system design.