System Design Interview: File Storage and Sync Service (Dropbox)

Designing a file storage and sync service like Dropbox combines chunked file storage, delta sync, conflict resolution, and offline-first architecture. It’s a comprehensive system design problem that tests your understanding of distributed storage and client-server synchronization.

Core Requirements

Functional: Upload/download files of any size. Sync files across multiple devices in real-time. Share files/folders with other users. Version history (at least 30 days). Offline support with sync on reconnect.

Non-functional: 500M users, 1B files. Average file size 1MB (photos, documents). 10M active syncs simultaneously. Upload/download: maximize throughput. Sync latency < 5 seconds after file change. Bandwidth efficiency: only transfer changed bytes (delta sync).

Chunked File Storage

Problem: large files (10GB videos) can't be uploaded atomically
  - Connection drops midway → must restart from beginning
  - Can't detect which part of file changed

Solution: split files into 4MB chunks

File upload flow:
  1. Client splits file into 4MB chunks
  2. Hash each chunk: SHA-256(chunk_data) → chunk_id
  3. Query server: which chunk_ids are already uploaded?
     POST /chunks/check {chunk_ids: ["abc123...", "def456..."]}
     → Server returns: {existing: ["abc123..."], missing: ["def456..."]}
  4. Upload only missing chunks (deduplication!)
     POST /chunks/upload {chunk_id: "def456...", data: binary}
     → Stored in S3 with key: chunks/{chunk_id}
  5. Create/update file metadata record:
     POST /files {name: "photo.jpg", chunks: ["abc123", "def456", "ghi789"], size: 12MB}

Deduplication benefits:
  If two users upload the same file → only one copy stored in S3
  If user edits a 1GB file (changes 1 paragraph):
    Only 1-2 changed chunks uploaded (4-8MB) vs re-uploading 1GB

Database Design

files table:
  id            UUID PRIMARY KEY
  owner_id      UUID
  name          TEXT
  parent_folder_id UUID NULL
  size_bytes    BIGINT
  mime_type     TEXT
  chunk_ids     TEXT[]  -- ordered list of SHA-256 chunk hashes
  version       INT DEFAULT 1
  is_deleted    BOOLEAN DEFAULT FALSE
  created_at    TIMESTAMPTZ DEFAULT NOW()
  modified_at   TIMESTAMPTZ DEFAULT NOW()

chunks table:
  id            CHAR(64) PRIMARY KEY  -- SHA-256 hash
  size_bytes    INT
  storage_path  TEXT  -- S3 key: chunks/{id}
  ref_count     INT DEFAULT 0  -- how many files reference this chunk

file_versions table (version history):
  id            UUID PRIMARY KEY
  file_id       UUID REFERENCES files(id)
  version_num   INT
  chunk_ids     TEXT[]
  modified_at   TIMESTAMPTZ
  modified_by   UUID

shares table:
  id            UUID PRIMARY KEY
  resource_id   UUID  -- file or folder
  resource_type TEXT  -- file / folder
  owner_id      UUID
  recipient_id  UUID  -- NULL for link sharing
  share_token   TEXT  -- for public link
  permission    TEXT  -- view / edit
  expires_at    TIMESTAMPTZ NULL

Sync Protocol

Multi-device sync challenge:
  Device A edits file.txt → uploads changes → server has version 5
  Device B (offline) edits same file.txt → comes online with local version 4

Sync engine: vector clocks + logical timestamps

Client state:
  Local DB (SQLite): {file_id, name, chunk_ids, server_version, local_version}
  Sync cursor: last_seen_server_event_id

Sync flow (client connects or file changes):
  1. Client → GET /events?since={cursor}
     Server returns: events since cursor
     [{event_id: 100, type: "file_updated", file_id: X, version: 5}, ...]

  2. For each event:
     If local file not modified since last sync:
       Apply server change → update local file
     If local file modified (conflict!):
       → Fork: keep both versions
         Local "file.txt" → "file (conflicted copy 2024-04-16).txt"
         Download server version → save as "file.txt"
         Both versions visible to user

  3. Upload local changes:
     For each locally modified file:
       Upload changed chunks → update server file record
       Server assigns new event_id → client updates cursor

Real-Time Notification (Change Events)

When file changes on any device, all other devices must be notified:

Server-Sent Events (SSE) connection per device:
  Client → GET /events/stream (long-lived HTTP connection)
  Server pushes: "data: {event_id: 101, type: 'file_updated', file_id: X}nn"

Event fan-out:
  File update → DB write → Publish to Redis Pub/Sub channel: user:{user_id}:events
  SSE broadcaster (subscribed to Redis) → push to all connected devices for that user

Connection management:
  Multiple devices per user → SSE broadcaster tracks device connections
  Device disconnect → SSE connection closes → no cleanup needed
  Device reconnect → GET /events?since={last_cursor} to catch up on missed events

Scale:
  10M active syncs × 1 SSE connection = 10M persistent connections
  Dedicated SSE server cluster (Go/nginx, highly concurrent)
  Redis Pub/Sub: fanout per user_id channel

Bandwidth Optimization

Delta encoding for text files:
  Instead of uploading entire 4MB chunk when 1 line changes:
  Compute diff (rsync algorithm): binary diff of old chunk vs new chunk
  Upload only the diff (often < 1KB for small edits)
  Server reconstructs new chunk: old_chunk + delta → new chunk

Rsync algorithm (simplified):
  1. Server sends checksums for each 1KB block of the old version
  2. Client slides a window over the new version, computing checksums
  3. For each matching block: send reference to server block (tiny)
  4. For each non-matching section: send raw bytes
  Net: often 99%+ reduction in bytes uploaded for document edits

Compression: gzip/brotli all chunks before upload (text: 70% reduction)

Bandwidth throttling:
  Background sync: limit to 30% of available bandwidth
  Detect user activity: slow sync when user is actively using network

Version History and Restore

Dropbox keeps 30-day version history (180 days for Pro):

Every file update:
  INSERT INTO file_versions (file_id, version_num, chunk_ids, modified_at)
  → Old chunk data stays in S3 (referenced by version records)

Restore to previous version:
  User requests version 3 of file.txt:
    SELECT chunk_ids FROM file_versions WHERE file_id=? AND version_num=3
    → Reconstruct file from those chunks (already in S3)
    → Client downloads chunks, reassembles file

Chunk garbage collection:
  Chunks not referenced by any current file OR version < 30 days → delete from S3
  Background job: scan chunks with ref_count=0 AND no active version references
  Soft delete: mark for deletion, actually delete after 7-day grace period

Interview Discussion Points

  • Why SHA-256 chunk IDs instead of random IDs? Content-addressable storage: if two files share the same chunk (identical file sections), they point to the same chunk in S3 — stored once. The hash IS the identity of the content. Random IDs would store duplicates. Collision probability for SHA-256: astronomically low (2^-128 per pair).
  • How do you handle very large files (100GB video)? S3 multipart upload: chunk the file, upload parts in parallel (up to 10,000 parts × 5MB = 50GB max; for larger files, increase part size). S3 handles reassembly. Client tracks which parts uploaded — resumes after interruption. Streaming: client doesn’t buffer the whole file; reads and uploads 4MB at a time.
  • Conflict resolution strategy: Most cloud storage (Dropbox, iCloud) uses “last write wins” with a conflict copy — no merge attempt. Google Docs achieves true merge via Operational Transformation (OT) or CRDT because it’s a single document type with defined merge semantics. For arbitrary binary files, OT doesn’t apply.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does Dropbox sync files efficiently without uploading the entire file on every change?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Dropbox uses content-defined chunking (CDC) to split files into variable-size blocks (average 4MB) using a rolling hash (Rabin fingerprint). Each chunk is identified by its SHA-256 hash. When a file changes, the client re-chunks it and computes SHA-256 for each chunk. Only chunks whose hash has changed (new or modified chunks) are uploaded u2014 unchanged chunks already exist in S3 and are referenced by hash. For a 1GB file where a user edits a paragraph: typically only 1-3 chunks change, uploading 4-12MB instead of 1GB. This block-level deduplication also works across users u2014 if two users have the same file, the chunks are stored once in S3. The sync metadata (file path, list of chunk hashes in order, file version) is stored separately in a PostgreSQL database; the chunk data lives in S3 with the SHA-256 hash as the key.”
}
},
{
“@type”: “Question”,
“name”: “How does a file sync service handle conflicts when two users edit the same file simultaneously?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The conflict copy pattern is the standard approach. When a client uploads a new version, the server compares the parent version hash the client claims to have descended from against the current server version. If they match (linear history), the upload proceeds normally. If they diverge (both user A and user B edited from the same base version), the server creates a conflict copy: it keeps both versions on disk with filenames like “document (User B’s conflicted copy 2024-01-15).docx” and notifies both clients. This approach never loses data and is simple to implement. For text files, some systems perform 3-way merge (base version + user A changes + user B changes) using a diff3 algorithm u2014 if the changes are in different parts of the file, they are auto-merged; only true conflicts (same lines changed differently) create a conflict copy.”
}
},
{
“@type”: “Question”,
“name”: “How does a file storage system notify all devices in real time when a file changes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Real-time sync notifications use Server-Sent Events (SSE) or long polling from clients to the notification service. When a file is updated, the upload handler publishes a notification event to a Redis Pub/Sub channel keyed by user_id (or workspace_id for shared folders). Each connected SSE client has a goroutine subscribed to the user’s Redis channel. When the event arrives, the goroutine pushes it to the client’s SSE stream: the client receives the event, identifies which file changed (from the file_id in the payload), and downloads only the changed chunks from S3 (not the full file). For offline devices, the notification service stores pending events in a database per device; when the device reconnects, it calls a “sync since last_seen_version” endpoint and replays all missed events. WebSockets (bidirectional) are also used u2014 SSE is preferred for the server-to-client notification path as it is simpler and automatically reconnects.”
}
}
]
}

  • Stripe Interview Guide
  • Apple Interview Guide
  • Atlassian Interview Guide
  • LinkedIn Interview Guide
  • Meta Interview Guide
  • Companies That Ask This

    Scroll to Top