File Storage System: Low-Level Design – Tech Interview Dot Org

A file storage system (like Dropbox or Google Drive) stores user files, provides sync across devices, and handles concurrent edits. The core challenges are: efficient storage of large files (chunking for deduplication and resumable uploads), conflict resolution when two clients edit the same file simultaneously, and efficient delta sync (only transferring changed bytes, not entire files).

Chunking Files

Large files are split into fixed-size chunks (4MB typical) before storage. Each chunk is hashed (SHA-256) and the hash is its content-addressable key. Benefits: deduplication — if two users upload the same 4MB chunk (common in video files, OS images), it is stored once; resumable uploads — if an upload fails, resume from the last successfully uploaded chunk; delta sync — when a file changes, only the modified chunks are re-uploaded, not the entire file.

Metadata about the file is stored separately: file_id, user_id, filename, size, chunk_hashes (ordered list), created_at, modified_at. The storage system stores chunks in object storage (S3, GCS) keyed by their hash. To reconstruct the file: fetch each chunk by hash in order and concatenate.

Delta Sync

When a file is modified, only changed chunks need to be re-uploaded. The client computes the new file’s chunk hashes and compares against the server’s stored chunk hashes for that file. Only chunks whose hashes differ are uploaded. For a 100MB document where the user edits 10KB in the middle, this reduces upload from 100MB to 4MB (one changed chunk). Dropbox uses Content-Defined Chunking (CDC) rather than fixed-size chunking — CDC splits at content-dependent boundaries, which avoids the “insert one byte shifts all chunk boundaries” problem of fixed-size chunking.

Conflict Resolution

When two clients edit the same file simultaneously (both offline, then both sync): each client produces a different version of the file. Resolution strategies: (1) Last-write-wins: the later upload overwrites the earlier — simple but loses work. (2) Conflict copy: both versions are kept; the conflicting version is renamed “file (1) (conflicted copy).txt” — user must manually merge. Dropbox uses this approach. (3) Operational Transform / CRDT: for text documents, merge edits at the character level (Google Docs approach). Conflict detection: compare the file version (a timestamp or vector clock) the client had when it last synced. If the server version has changed since, a conflict exists.

Metadata Store

File metadata (filename, path, version, chunk list) is stored in a relational database. Schema: files (file_id, user_id, path, parent_folder_id, version, is_deleted, created_at, modified_at), file_chunks (file_id, chunk_sequence, chunk_hash). Version history: each file edit creates a new version record — file_versions (file_id, version_number, chunk_hashes, modified_at). This enables version history (“restore to yesterday’s version”) without storing redundant chunks — only the chunk_hash list changes; chunks are shared across versions via content-addressable storage.

Upload Flow

Client → API server: (1) client splits file into chunks, computes hashes; (2) client sends chunk hash list to API server (“which of these chunks do I need to upload?”); (3) server returns a list of missing chunk hashes (hashes not yet in object storage); (4) client uploads only missing chunks directly to object storage (via pre-signed S3 URLs — bypasses the API server, reducing its load); (5) client notifies API server “upload complete” with the full chunk hash list; (6) API server creates/updates the file metadata record. This flow minimizes API server load (only metadata, no file bytes), enables deduplication before upload (checking hashes first avoids uploading already-stored chunks), and supports resumable uploads naturally.