File Version Control System Low-Level Design

File Version Control System — Low-Level Design

A file version control system tracks the history of changes to files, supports branching, and can restore previous versions. Simplified version: Google Drive “version history” or Dropbox file versions. This design is asked at Dropbox, Google, and document collaboration platforms.

Core Data Model

File
  id              BIGSERIAL PK
  owner_id        BIGINT NOT NULL
  name            TEXT NOT NULL
  current_version_id BIGINT          -- FK to FileVersion (denormalized latest)
  deleted_at      TIMESTAMPTZ
  created_at      TIMESTAMPTZ

FileVersion
  id              BIGSERIAL PK
  file_id         BIGINT FK NOT NULL
  version_number  INT NOT NULL       -- monotonically increasing per file
  author_id       BIGINT NOT NULL
  s3_key          TEXT NOT NULL      -- where the content lives in S3
  content_hash    TEXT NOT NULL      -- SHA-256 of file content
  size_bytes      BIGINT NOT NULL
  commit_message  TEXT               -- optional description
  created_at      TIMESTAMPTZ NOT NULL
  UNIQUE (file_id, version_number)

-- Index for version history query
CREATE INDEX idx_versions_file ON FileVersion(file_id, version_number DESC);

Creating a New Version

def upload_new_version(file_id, author_id, file_bytes, commit_message=None):
    content_hash = sha256(file_bytes).hexdigest()

    # Check for duplicate content
    existing = db.execute("""
        SELECT id FROM FileVersion
        WHERE file_id=%(fid)s AND content_hash=%(hash)s
        ORDER BY version_number DESC LIMIT 1
    """, {'fid': file_id, 'hash': content_hash}).first()

    if existing:
        # Content identical to a previous version — no-op or log intent without storing
        return {'version_id': existing.id, 'created': False}

    with db.transaction():
        # Assign next version number
        next_ver = db.execute("""
            SELECT COALESCE(MAX(version_number), 0) + 1
            FROM FileVersion WHERE file_id=%(fid)s
        """, {'fid': file_id}).scalar()

        s3_key = f'versions/{file_id}/{next_ver}/{content_hash[:8]}'
        s3.put_object(Bucket='files-bucket', Key=s3_key, Body=file_bytes)

        version = db.insert(FileVersion, {
            'file_id': file_id,
            'version_number': next_ver,
            'author_id': author_id,
            's3_key': s3_key,
            'content_hash': content_hash,
            'size_bytes': len(file_bytes),
            'commit_message': commit_message,
        })

        # Update file's current version pointer
        db.execute("""
            UPDATE File SET current_version_id=%(vid)s
            WHERE id=%(fid)s
        """, {'vid': version.id, 'fid': file_id})

    return {'version_id': version.id, 'version_number': next_ver, 'created': True}

Retrieving Version History

def get_version_history(file_id, limit=50, cursor=None):
    cursor_clause = 'AND version_number < %(cursor)s' if cursor else ''
    versions = db.execute(f"""
        SELECT fv.*, u.display_name as author_name
        FROM FileVersion fv
        JOIN User u ON fv.author_id = u.id
        WHERE fv.file_id=%(fid)s {cursor_clause}
        ORDER BY fv.version_number DESC
        LIMIT %(limit)s
    """, {'fid': file_id, 'limit': limit, 'cursor': cursor})
    return versions

def restore_version(file_id, version_number, restoring_user_id):
    """Restore creates a NEW version with the old content — never mutates history."""
    old_version = db.execute("""
        SELECT * FROM FileVersion
        WHERE file_id=%(fid)s AND version_number=%(ver)s
    """, {'fid': file_id, 'ver': version_number}).first()

    if not old_version:
        raise NotFound()

    # Re-use the S3 object; create new version pointing to same s3_key
    with db.transaction():
        next_ver = db.execute("""
            SELECT MAX(version_number) + 1 FROM FileVersion WHERE file_id=%(fid)s
        """, {'fid': file_id}).scalar()

        new_version = db.insert(FileVersion, {
            'file_id': file_id,
            'version_number': next_ver,
            'author_id': restoring_user_id,
            's3_key': old_version.s3_key,  # reuse S3 object
            'content_hash': old_version.content_hash,
            'size_bytes': old_version.size_bytes,
            'commit_message': f'Restored from version {version_number}',
        })
        db.execute("UPDATE File SET current_version_id=%(vid)s WHERE id=%(fid)s",
                   {'vid': new_version.id, 'fid': file_id})

Delta Storage (Reducing S3 Costs)

-- For text files: store diffs instead of full content
-- For binary files (videos, images): store full content each version

def store_version_delta(old_s3_key, new_content_bytes):
    old_content = s3.get_object(Bucket='files-bucket', Key=old_s3_key)['Body'].read()

    # Compute binary delta
    delta = bsdiff4.diff(old_content, new_content_bytes)

    # If delta is larger than full content (e.g., image edits), store full copy
    if len(delta) > len(new_content_bytes) * 0.9:
        return None, new_content_bytes  # Store full

    return delta, None  # Store delta

-- FileVersion additions for delta storage:
  delta_base_version_id BIGINT  -- which version to apply delta against
  storage_type TEXT DEFAULT 'full'  -- 'full' or 'delta'

Key Interview Points

  • Restore creates a new version, never mutates history: Immutable version history is the core invariant. Restoration means “create version N+1 with the content of version K”, not “overwrite version N”.
  • Content-hash deduplication: If a user saves a file with identical content to a previous version, skip creating a new version. Content hash (SHA-256) detects this without comparing bytes.
  • S3 key strategy: Never use the version number alone as the S3 key — if two files have version 1, keys collide. Use file_id/version_number/hash_prefix or file_id/uuid for uniqueness.
  • Version number vs ID for cursor pagination: Use version_number (not the auto-increment ID) as the cursor for version history pagination — it’s semantically meaningful and monotonically increasing per file.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why should file restoration create a new version instead of reverting?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Immutable version history is the core invariant of any version control system. If restoration modifies an existing version, the history becomes untrustworthy — users cannot see what the file looked like at a given point in time. Creating a new version (N+1 with the content of version K) preserves the complete audit trail: "Alice edited to version 3, Bob reverted to version 1 content at version 4." This makes the history linear and append-only. The same principle applies to payment records, audit logs, and any system where historical correctness is important.”}},{“@type”:”Question”,”name”:”How does content-hash deduplication save storage?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Before creating a new version, compute SHA-256(file_bytes) and check if any existing FileVersion for the same file has the same hash. If a match is found, skip creating a new S3 object — point the new version record to the existing S3 key. This saves storage when users save a file without changes (e.g., hit save without editing). It also saves storage when a user reverts to a previous version — restoration reuses the original S3 object rather than uploading a second copy of the same bytes. Hash computation is cheap (~10ms for a 10MB file); S3 PUT cost ($0.005/1K requests) adds up across millions of saves.”}},{“@type”:”Question”,”name”:”What is the S3 key naming strategy for versioned files?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Never use just the version_number as the S3 key (e.g., file-123/v1) — two different files both have version 1, so keys would collide. Use file_id + version_number + content_hash prefix: uploads/{file_id}/{version_number}/{hash[:8]}. The file_id provides uniqueness across files. The version_number makes the key human-readable and sorted. The hash prefix ensures that if a version row is re-created (after a partial failure), it does not conflict with the previous attempt’s S3 object. For file restoration: point the new version record to the old S3 key directly — no new upload needed.”}},{“@type”:”Question”,”name”:”How do you implement version diffing for text files?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”For text files, compute a unified diff between adjacent versions using the Myers diff algorithm (git’s core algorithm, available as the diff-match-patch library in most languages). Store the delta in S3 instead of the full file content when the delta is smaller than the full file (typically true for incremental edits). On retrieval: start from the nearest full snapshot and apply deltas forward. Limit the delta chain length to prevent slow reconstruction — every 10 versions, store a full snapshot as an anchor point. For binary files (images, PDFs), delta compression is rarely effective; store full copies.”}},{“@type”:”Question”,”name”:”How do you enforce version retention limits per user plan?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Free plan: keep last 30 versions. Pro plan: keep last 500 versions. Enterprise: unlimited. Store the plan’s version_limit on the Tenant or User record. After each new version is created, check: SELECT COUNT(*) FROM FileVersion WHERE file_id=:fid. If count > limit, identify the oldest versions beyond the limit (ORDER BY version_number ASC LIMIT excess_count) and schedule them for deletion. Mark for deletion first (soft delete), verify the S3 object is not referenced by other versions (content_hash deduplication means one S3 object can be referenced by multiple FileVersion rows), then delete S3 object and the FileVersion row.”}}]}

File version control and document history design is discussed in Google system design interview questions.

File version control and document collaboration design is covered in Atlassian system design interview preparation.

File versioning and S3 storage system design is discussed in Amazon system design interview guide.

Scroll to Top