File Storage System (Google Drive / Dropbox) Low-Level Design

Requirements

Design a file storage system like Google Drive or Dropbox supporting:

  • Upload and download files up to 50 GB
  • File versioning and restore
  • Sharing with granular permissions
  • Real-time sync across devices
  • 1 billion users, millions of concurrent uploads

Data Model

Three core tables handle files, versions, and deduplication:

File(file_id, owner_id, name, parent_folder_id, created_at, is_deleted)
FileVersion(version_id, file_id, chunk_manifest JSONB, size_bytes, created_at)
Chunk(chunk_id, sha256_hash, storage_path, ref_count)
SharePermission(file_id, grantee_id, permission_level ENUM('READ','WRITE','ADMIN'))

chunk_manifest is an ordered array of chunk_id values. Reconstructing a file means fetching chunks in order and concatenating them.

Chunked Upload with Deduplication

Large files are split into 4 MB chunks on the client before upload begins:

  1. Client splits the file and computes SHA-256 for each chunk.
  2. Client calls POST /files/upload-session with the list of chunk hashes. Server responds with which hashes it already has (deduplication check).
  3. Client uploads only the missing chunks via PUT /chunks/{upload_id}/{chunk_index}.
  4. Server stores each chunk in S3 and inserts a row in Chunk table, or increments ref_count if the hash already exists.
  5. Client calls POST /files/commit with the ordered list of chunk IDs to finalize the file version.

This design means two users uploading the same file only store one copy in object storage — deduplication by content hash.

Resumable Uploads

Network failures are common for large files. The server tracks progress via an UploadSession table:

UploadSession(upload_id, file_id, chunks_expected, chunks_received SET, expires_at)

The client can query GET /upload-session/{upload_id} at any time to learn which chunks were received, then resume by uploading only the missing ones.

Sync Protocol

To sync changes across devices, the server maintains a change_log table:

ChangeLog(change_id, user_id, file_id, change_type, version_id, created_at)

Clients poll GET /changes?since={last_change_id} or receive pushes via WebSocket. Only the changed chunks are transferred (delta sync). The client merges changes locally; conflicts are resolved by last-write-wins on the version timestamp, or flagged for manual resolution.

Sharing and Permissions

The SharePermission table controls access. Public share links use a random opaque token stored in a ShareLink(token, file_id, permission_level, expires_at) table. Permission checks happen at the API gateway before any storage access.

Versioning

Every commit creates a new FileVersion row. Old versions are retained for a configurable period (e.g., 30 days or last 100 versions). A background job soft-deletes expired versions and decrements ref_count on their chunks. When a chunk’s ref_count reaches zero, a GC job removes it from S3.

Storage Stack

  • Object storage (S3): raw chunk bytes, addressed by hash
  • PostgreSQL: files, versions, chunks, permissions — transactional metadata
  • Elasticsearch: full-text search over file names and extracted text content
  • Redis: active upload session state, recently accessed chunk manifests

Key APIs

POST   /files/upload-session          → {upload_id, missing_chunks[]}
PUT    /chunks/{upload_id}/{chunk_idx} → 200 OK
POST   /files/commit                  → {file_id, version_id}
GET    /files/{file_id}               → file metadata + download URL
GET    /files/{file_id}/versions      → list of FileVersion
GET    /changes?since={change_id}     → ChangeLog entries

Interview Tips

  • Interviewers often ask: what happens if a chunk upload succeeds but commit fails? Answer: upload session is idempotent; client retries commit, server re-checks chunk presence.
  • Deduplication works per-chunk, not per-file, so even partially similar files save bandwidth.
  • For celebrity files (shared with millions), pre-warm CDN edge caches on commit.


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you upload large files efficiently in a file storage system?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Split the file into fixed-size chunks (4MB is common). The client computes SHA-256 of each chunk before uploading. For each chunk, POST /chunks with the hash; the server checks if that hash already exists in the Chunk table (content-addressed deduplication) – if so, return 200 without storing again. If not, stream the bytes to object storage (S3) and record the chunk. Maintain an UploadSession with upload_id and a bitmap of which chunks have been received. When all chunks are confirmed, POST /files/commit with the chunk manifest to assemble the file record. This design enables resumable uploads: if the connection drops, the client retries only the missing chunks identified from the upload session.”}},{“@type”:”Question”,”name”:”How does file deduplication work in a storage system like Dropbox?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Content-addressed deduplication: every 4MB chunk is identified by its SHA-256 hash. The Chunk table has (chunk_hash, storage_path, ref_count). When uploading a chunk, check if chunk_hash already exists. If yes: increment ref_count and reuse the existing stored bytes – no I/O to S3. If no: store to S3 and insert a new Chunk record. At file delete/version GC time: decrement ref_count for each chunk in the manifest; when ref_count reaches 0, schedule the S3 object for deletion. This achieves cross-user deduplication: if two users upload the same file, the bytes are stored once. Dropbox reported 50%+ storage savings from deduplication across their user base.”}},{“@type”:”Question”,”name”:”How do you implement delta sync for a file storage system?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Maintain a ChangeLog table: (change_id auto-increment, user_id, file_id, change_type enum CREATE/UPDATE/DELETE, version_id, created_at). Each device tracks the last change_id it processed (stored locally). On sync: GET /changes?since={last_change_id}&user_id=X. The server returns all changes with change_id > last_change_id. The device applies changes in order: for CREATE/UPDATE, download only the new/modified chunks (compare chunk manifests to determine which chunks changed). For DELETE, remove the local file. Conflict resolution: last-write-wins by default; for advanced cases, create a conflict copy (filename_conflict_2024-01-01.pdf). WebSocket push notifications trigger sync immediately instead of polling.”}},{“@type”:”Question”,”name”:”How do you model file versioning in Google Drive / Dropbox?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Each file edit creates a new FileVersion record: (version_id, file_id, chunk_manifest JSON, size_bytes, created_by, created_at). The chunk_manifest is an ordered array of chunk hashes. The File record points to the current_version_id. To view an older version: fetch the FileVersion and reconstruct from its chunk_manifest. Storage cost of versioning is minimal due to deduplication – if only 10% of chunks changed, only 10% of chunks are stored anew; the unchanged 90% are shared via ref_count. Version retention policy: keep last 30 versions by default, or 180 days. A GC job periodically marks old versions for deletion and decrements ref_counts.”}},{“@type”:”Question”,”name”:”How do you handle file sharing and permissions in a storage system?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Permission model: SharePermission(permission_id, resource_id, resource_type enum FILE/FOLDER, grantee_id, grantee_type enum USER/GROUP/PUBLIC, permission_level enum VIEWER/COMMENTER/EDITOR, created_by, expires_at). Permission inheritance: files inherit permissions from their parent folder unless explicitly overridden. Public share links: generate a random token, store in ShareLink(token, resource_id, permission_level, expires_at). On access: look up the token, verify not expired, grant the corresponding permission level. Permission check on every API call: query SharePermission for (resource_id, requesting_user_id) with inheritance traversal up the folder tree. Cache permission lookups in Redis with TTL=5 minutes to avoid recursive DB queries on every request.”}}]}

Google system design interviews cover Google Drive and file storage. See common questions for Google interview: file storage and Google Drive system design.

Dropbox system design is a canonical file storage interview topic. Review design patterns for Dropbox interview: file sync and storage system design.

Meta system design interviews cover file and media storage at scale. See patterns for Meta interview: file upload and media storage system design.

Scroll to Top