Question 1

How do you upload large files efficiently in a file storage system?

Accepted Answer

Split the file into fixed-size chunks (4MB is common). The client computes SHA-256 of each chunk before uploading. For each chunk, POST /chunks with the hash; the server checks if that hash already exists in the Chunk table (content-addressed deduplication) - if so, return 200 without storing again. If not, stream the bytes to object storage (S3) and record the chunk. Maintain an UploadSession with upload_id and a bitmap of which chunks have been received. When all chunks are confirmed, POST /files/commit with the chunk manifest to assemble the file record. This design enables resumable uploads: if the connection drops, the client retries only the missing chunks identified from the upload session.

Question 2

How does file deduplication work in a storage system like Dropbox?

Accepted Answer

Content-addressed deduplication: every 4MB chunk is identified by its SHA-256 hash. The Chunk table has (chunk_hash, storage_path, ref_count). When uploading a chunk, check if chunk_hash already exists. If yes: increment ref_count and reuse the existing stored bytes - no I/O to S3. If no: store to S3 and insert a new Chunk record. At file delete/version GC time: decrement ref_count for each chunk in the manifest; when ref_count reaches 0, schedule the S3 object for deletion. This achieves cross-user deduplication: if two users upload the same file, the bytes are stored once. Dropbox reported 50%+ storage savings from deduplication across their user base.

Question 3

How do you implement delta sync for a file storage system?

Accepted Answer

Maintain a ChangeLog table: (change_id auto-increment, user_id, file_id, change_type enum CREATE/UPDATE/DELETE, version_id, created_at). Each device tracks the last change_id it processed (stored locally). On sync: GET /changes?since={last_change_id}&user_id=X. The server returns all changes with change_id > last_change_id. The device applies changes in order: for CREATE/UPDATE, download only the new/modified chunks (compare chunk manifests to determine which chunks changed). For DELETE, remove the local file. Conflict resolution: last-write-wins by default; for advanced cases, create a conflict copy (filename_conflict_2024-01-01.pdf). WebSocket push notifications trigger sync immediately instead of polling.

Question 4

How do you model file versioning in Google Drive / Dropbox?

Accepted Answer

Each file edit creates a new FileVersion record: (version_id, file_id, chunk_manifest JSON, size_bytes, created_by, created_at). The chunk_manifest is an ordered array of chunk hashes. The File record points to the current_version_id. To view an older version: fetch the FileVersion and reconstruct from its chunk_manifest. Storage cost of versioning is minimal due to deduplication - if only 10% of chunks changed, only 10% of chunks are stored anew; the unchanged 90% are shared via ref_count. Version retention policy: keep last 30 versions by default, or 180 days. A GC job periodically marks old versions for deletion and decrements ref_counts.

Question 5

How do you handle file sharing and permissions in a storage system?

Accepted Answer

Permission model: SharePermission(permission_id, resource_id, resource_type enum FILE/FOLDER, grantee_id, grantee_type enum USER/GROUP/PUBLIC, permission_level enum VIEWER/COMMENTER/EDITOR, created_by, expires_at). Permission inheritance: files inherit permissions from their parent folder unless explicitly overridden. Public share links: generate a random token, store in ShareLink(token, resource_id, permission_level, expires_at). On access: look up the token, verify not expired, grant the corresponding permission level. Permission check on every API call: query SharePermission for (resource_id, requesting_user_id) with inheritance traversal up the folder tree. Cache permission lookups in Redis with TTL=5 minutes to avoid recursive DB queries on every request.

File Storage System (Google Drive / Dropbox) Low-Level Design

Requirements

Data Model

Chunked Upload with Deduplication

Resumable Uploads

Sync Protocol

Versioning

Storage Stack

Key APIs

Interview Tips