Low Level Design: File Sync Service

Overview

A file sync service keeps files on multiple devices in sync: a user creates a file on their laptop and it appears on their phone and desktop automatically, with the latest version always winning and conflicts surfaced cleanly. Dropbox, iCloud Drive, and OneDrive are the canonical examples. Building a correct and efficient sync service requires careful attention to chunked storage, delta transfers, conflict detection, and a device registry that tracks what each device has already seen.

Core Requirements

  • Files up to several GB can be uploaded and synced across devices.
  • Only changed portions of a file (deltas) are transferred on update to minimize bandwidth.
  • Conflict detection: when two devices modify the same file offline, the service creates a conflict copy rather than silently losing data.
  • Each device maintains a local cache and syncs on reconnect.
  • Metadata operations (rename, move, delete) are fast and do not require transferring file data.
  • Files are deduplicated at the chunk level across all users.

Data Model

CREATE TABLE users (
    id          BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    email       VARCHAR(255) NOT NULL UNIQUE,
    quota_bytes BIGINT UNSIGNED NOT NULL DEFAULT 5368709120,  -- 5 GB default
    used_bytes  BIGINT UNSIGNED NOT NULL DEFAULT 0,
    created_at  DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE devices (
    id          BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    user_id     BIGINT UNSIGNED NOT NULL,
    device_key  VARCHAR(128) NOT NULL UNIQUE,  -- UUID set on first registration
    name        VARCHAR(255),
    platform    VARCHAR(64),
    last_seen   DATETIME,
    cursor      BIGINT UNSIGNED NOT NULL DEFAULT 0,  -- last sync event ID seen
    INDEX idx_user (user_id)
);

CREATE TABLE files (
    id              BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    user_id         BIGINT UNSIGNED NOT NULL,
    parent_id       BIGINT UNSIGNED,                -- NULL for root
    name            VARCHAR(1024) NOT NULL,
    is_folder       TINYINT(1) NOT NULL DEFAULT 0,
    size_bytes      BIGINT UNSIGNED NOT NULL DEFAULT 0,
    content_hash    CHAR(64),                        -- SHA-256 of full file content
    rev             BIGINT UNSIGNED NOT NULL DEFAULT 1,
    server_modified DATETIME(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
    client_modified DATETIME(3),
    is_deleted      TINYINT(1) NOT NULL DEFAULT 0,
    INDEX idx_user_parent (user_id, parent_id),
    INDEX idx_hash (content_hash)
);

CREATE TABLE file_chunks (
    file_id     BIGINT UNSIGNED NOT NULL,
    chunk_index INT UNSIGNED NOT NULL,
    chunk_hash  CHAR(64) NOT NULL,   -- SHA-256 of chunk content
    chunk_size  INT UNSIGNED NOT NULL,
    PRIMARY KEY (file_id, chunk_index),
    INDEX idx_chunk_hash (chunk_hash)
);

CREATE TABLE chunks (
    hash        CHAR(64) PRIMARY KEY,
    size_bytes  INT UNSIGNED NOT NULL,
    storage_key VARCHAR(512) NOT NULL,  -- S3 key or blob storage path
    ref_count   INT UNSIGNED NOT NULL DEFAULT 0,
    created_at  DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE sync_events (
    id          BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    user_id     BIGINT UNSIGNED NOT NULL,
    file_id     BIGINT UNSIGNED NOT NULL,
    event_type  ENUM('create','update','delete','move','rename') NOT NULL,
    rev         BIGINT UNSIGNED NOT NULL,
    created_at  DATETIME(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
    INDEX idx_user_id (user_id, id)
);

Chunked Storage and Delta Sync

Uploading a 2 GB file on every change is prohibitively expensive in both bandwidth and time. Instead, files are split into variable-size chunks using content-defined chunking (CDC). CDC uses a rolling hash (e.g., Rabin fingerprint) to find chunk boundaries based on the content itself rather than fixed offsets. This means that inserting text near the start of a file does not shift all subsequent chunk boundaries — most chunks remain unchanged. Chunk sizes typically target 4–8 MB with a minimum of 64 KB and a maximum of 16 MB.

On upload, the client:

  1. Computes the rolling hash and splits the file into chunks.
  2. SHA-256 hashes each chunk.
  3. Sends the list of chunk hashes to the server in a start_upload request.
  4. The server compares the list against the chunks table and returns a list of hashes that are not yet stored (missing chunks).
  5. The client uploads only the missing chunks.
  6. The client calls finish_upload with the ordered list of all chunk hashes. The server writes the file_chunks rows and computes the file's content_hash as the hash of the chunk hash list.

On download, the client requests the chunk list for a file. It already has chunks it downloaded previously (stored in a local chunk cache keyed by hash). It downloads only the missing chunks and reassembles the file locally.

Device Cursor and Polling

Each device tracks its position in the global sync_events table via a cursor (the ID of the last event it processed). When a device comes online, it calls GET /sync?cursor=N. The server returns all events with id > N for that user, up to a page limit. The device processes each event in order, downloading changed chunks as needed, and advances its cursor. Long-polling or WebSocket push can replace polling for lower latency.

Conflict Resolution

A conflict occurs when two devices modify the same file while both are offline, then sync. Detection algorithm:

  1. Device A uploads a new revision of file F. Server records rev=5 for file F.
  2. Device B comes online and attempts to upload its own modification of F. It sends its known base revision (rev=4).
  3. The server sees that the current server rev (5) does not match the client's base rev (4), meaning another device modified the file in between.
  4. The server creates a conflict copy: a new file named "filename (Device B's conflicted copy 2026-04-17).ext", containing B's content. B's upload is stored as this conflict copy rather than overwriting A's version.
  5. Both files are synced to all devices. The user sees both and resolves manually.

For plain text files, the service can optionally offer a three-way merge using the common ancestor (the snapshot at the base rev). If the merge produces no conflicts, it is applied automatically and the conflict copy is not created. Binary files always produce conflict copies.

Metadata Operations

Rename and move operations update only the files table (name, parent_id) and emit a sync event. No chunk data is touched. Delete sets is_deleted=1 and emits a delete event. Chunk reference counts are decremented asynchronously by a garbage collector that runs periodically; a chunk is physically deleted from object storage only when its ref_count reaches zero.

Chunk Deduplication

The chunks table is global across all users. If user A and user B both store a file containing the same 4 MB chunk, that chunk is stored in object storage exactly once. The ref_count column tracks how many file_chunks rows reference it. Deduplication is content-addressed by SHA-256 hash, which is collision-resistant for practical purposes. This can provide 20–40% storage savings for common file types like office documents and source code.

Bandwidth-Efficient Sync Protocol

  • Chunk upload deduplication: The start_upload / check hashes / upload missing / finish_upload handshake ensures zero bytes are transferred for chunks already on the server.
  • Compression: Chunks are compressed (zstd) before upload and stored compressed in object storage. The compression ratio is best for text-heavy files; already-compressed formats (JPEG, MP4, ZIP) are stored uncompressed after a fast entropy check.
  • Parallel chunk uploads: The client uploads multiple missing chunks in parallel (up to 4 concurrent uploads) to saturate available bandwidth.
  • Resumable uploads: Each chunk upload is independent and idempotent. If the connection drops mid-upload, the client re-runs the start_upload handshake; the server skips chunks it already has and the client retransmits only the remaining ones.

Failure Handling and Edge Cases

  • Partial upload: If the client crashes after uploading some chunks but before calling finish_upload, the orphaned chunks remain in object storage. A background job finds chunks with ref_count=0 that are older than 24 hours and deletes them from storage and the chunks table.
  • Clock skew: client_modified timestamps come from the client device clock, which may be wrong. The server always uses server_modified for ordering decisions. client_modified is metadata only, never used for conflict detection.
  • Quota enforcement: Quota is checked before accepting a finish_upload call. The check compares (current used_bytes + new_file_size – old_file_size) against quota_bytes. If over quota, the upload is rejected with a 402 response. used_bytes is updated atomically using a database transaction.
  • Hash collision (theoretical): On finish_upload, the server verifies that each chunk's stored size matches the declared size. A length mismatch alongside a hash match indicates a problem; the server rejects the upload and forces a fresh upload without deduplication for those chunks.
  • Large folder trees: Fetching the full folder tree on initial sync can be slow for users with hundreds of thousands of files. Use pagination on the tree endpoint and build the local index incrementally.

Scalability Considerations

  • Object storage: All chunk data lives in S3 or equivalent. The database stores only metadata. S3 scales to petabytes without operational overhead.
  • Presigned URLs: Rather than proxying chunk uploads through the application server, issue S3 presigned PUT URLs directly to the client. This removes the application tier from the data path entirely, dramatically reducing server bandwidth costs and latency.
  • Event table partitioning: The sync_events table grows without bound. Partition by user_id range or by time. Archive events older than 90 days to cold storage; devices that have been offline longer than 90 days must perform a full re-sync rather than a delta sync.
  • Metadata database sharding: Shard by user_id. All files, chunks references, and events for a user live on the same shard, keeping cross-shard joins unnecessary for the common case.
  • CDN for downloads: Serve chunk downloads through a CDN. Chunks are immutable (content-addressed), so they cache perfectly at edge nodes. Cache-Control: immutable, max-age=31536000 is correct and safe for all chunks.

Summary

A file sync service is a distributed content-addressed storage system layered with a per-user event log. Content-defined chunking and global chunk deduplication make the storage layer efficient; the start_upload handshake makes transfers bandwidth-efficient; the device cursor model makes sync resumable and correct across reconnects. Conflict resolution via conflict copies is conservative but safe — it never loses data. Scaling is achieved by keeping chunk data in object storage (accessed via presigned URLs) and sharding metadata by user, making each shard an independently scalable unit.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does a file sync service detect changed files efficiently?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “File sync services typically combine OS-level filesystem watch APIs (inotify on Linux, FSEvents on macOS, ReadDirectoryChangesW on Windows) with periodic full-scan reconciliation as a fallback. The watcher delivers real-time change events (create, modify, delete, rename) keyed by inode and path. Each event triggers a checksum (e.g., SHA-256) or size+mtime comparison against a local metadata database. Only files whose checksum has actually changed are enqueued for upload, filtering out spurious events such as access-time updates. On mobile or battery-constrained clients, polling with exponential backoff is used instead of continuous watching.”
}
},
{
“@type”: “Question”,
“name”: “What is content-defined chunking and why is it used in file sync?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Content-defined chunking (CDC) splits a file into variable-size blocks by scanning for a rolling hash (e.g., Rabin fingerprint) boundary wherever the hash value matches a predefined pattern, rather than cutting at fixed byte offsets. This means that inserting data in the middle of a file shifts fixed-size chunks entirely, requiring re-upload of everything after the insertion point. With CDC, only the chunk containing the change and its immediate neighbors are invalidated; all other chunks retain their boundaries and hashes, so only the truly modified chunks need to be transferred. This dramatically reduces bandwidth for large files that receive small edits, such as database files or virtual disk images.”
}
},
{
“@type”: “Question”,
“name”: “How does a file sync service resolve conflicts when the same file is edited on two devices?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When the sync server receives a new version of a file whose server-side revision has advanced since the client’s last known version, it detects a conflict. The most common strategy is to preserve both versions: the server keeps its current authoritative version and renames the conflicting upload to a conflict copy (e.g., appending the device name and timestamp to the filename), then notifies both clients. For text files, some services attempt a three-way merge using the common ancestor revision; if the merge succeeds without overlap the result is committed automatically. If the merge fails, or for binary files, the conflict copy approach is used and the user is prompted to resolve manually.”
}
},
{
“@type”: “Question”,
“name”: “How is delta sync implemented to minimize bandwidth usage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Delta sync transmits only the portions of a file that have changed rather than the entire file. The client splits the file into chunks (fixed or content-defined), computes a hash for each chunk, and sends the hash list to the server. The server compares these hashes against the chunks it already stores for the previous version and returns a list of which chunks it needs. The client uploads only the missing or changed chunks. On the server side, the new file version is reconstructed by combining unchanged stored chunks with the newly uploaded ones. Libraries like rsync’s rolling-checksum algorithm implement a similar protocol, and cloud sync services extend it with deduplication across users so a chunk already stored by any user never needs to be uploaded again.”
}
}
]
}

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Atlassian Interview Guide

Scroll to Top