File Version Control System — Low-Level Design
A file version control system tracks the history of changes to files, supports branching, and can restore previous versions. Simplified version: Google Drive “version history” or Dropbox file versions. This design is asked at Dropbox, Google, and document collaboration platforms.
Core Data Model
File
id BIGSERIAL PK
owner_id BIGINT NOT NULL
name TEXT NOT NULL
current_version_id BIGINT -- FK to FileVersion (denormalized latest)
deleted_at TIMESTAMPTZ
created_at TIMESTAMPTZ
FileVersion
id BIGSERIAL PK
file_id BIGINT FK NOT NULL
version_number INT NOT NULL -- monotonically increasing per file
author_id BIGINT NOT NULL
s3_key TEXT NOT NULL -- where the content lives in S3
content_hash TEXT NOT NULL -- SHA-256 of file content
size_bytes BIGINT NOT NULL
commit_message TEXT -- optional description
created_at TIMESTAMPTZ NOT NULL
UNIQUE (file_id, version_number)
-- Index for version history query
CREATE INDEX idx_versions_file ON FileVersion(file_id, version_number DESC);
Creating a New Version
def upload_new_version(file_id, author_id, file_bytes, commit_message=None):
content_hash = sha256(file_bytes).hexdigest()
# Check for duplicate content
existing = db.execute("""
SELECT id FROM FileVersion
WHERE file_id=%(fid)s AND content_hash=%(hash)s
ORDER BY version_number DESC LIMIT 1
""", {'fid': file_id, 'hash': content_hash}).first()
if existing:
# Content identical to a previous version — no-op or log intent without storing
return {'version_id': existing.id, 'created': False}
with db.transaction():
# Assign next version number
next_ver = db.execute("""
SELECT COALESCE(MAX(version_number), 0) + 1
FROM FileVersion WHERE file_id=%(fid)s
""", {'fid': file_id}).scalar()
s3_key = f'versions/{file_id}/{next_ver}/{content_hash[:8]}'
s3.put_object(Bucket='files-bucket', Key=s3_key, Body=file_bytes)
version = db.insert(FileVersion, {
'file_id': file_id,
'version_number': next_ver,
'author_id': author_id,
's3_key': s3_key,
'content_hash': content_hash,
'size_bytes': len(file_bytes),
'commit_message': commit_message,
})
# Update file's current version pointer
db.execute("""
UPDATE File SET current_version_id=%(vid)s
WHERE id=%(fid)s
""", {'vid': version.id, 'fid': file_id})
return {'version_id': version.id, 'version_number': next_ver, 'created': True}
Retrieving Version History
def get_version_history(file_id, limit=50, cursor=None):
cursor_clause = 'AND version_number < %(cursor)s' if cursor else ''
versions = db.execute(f"""
SELECT fv.*, u.display_name as author_name
FROM FileVersion fv
JOIN User u ON fv.author_id = u.id
WHERE fv.file_id=%(fid)s {cursor_clause}
ORDER BY fv.version_number DESC
LIMIT %(limit)s
""", {'fid': file_id, 'limit': limit, 'cursor': cursor})
return versions
def restore_version(file_id, version_number, restoring_user_id):
"""Restore creates a NEW version with the old content — never mutates history."""
old_version = db.execute("""
SELECT * FROM FileVersion
WHERE file_id=%(fid)s AND version_number=%(ver)s
""", {'fid': file_id, 'ver': version_number}).first()
if not old_version:
raise NotFound()
# Re-use the S3 object; create new version pointing to same s3_key
with db.transaction():
next_ver = db.execute("""
SELECT MAX(version_number) + 1 FROM FileVersion WHERE file_id=%(fid)s
""", {'fid': file_id}).scalar()
new_version = db.insert(FileVersion, {
'file_id': file_id,
'version_number': next_ver,
'author_id': restoring_user_id,
's3_key': old_version.s3_key, # reuse S3 object
'content_hash': old_version.content_hash,
'size_bytes': old_version.size_bytes,
'commit_message': f'Restored from version {version_number}',
})
db.execute("UPDATE File SET current_version_id=%(vid)s WHERE id=%(fid)s",
{'vid': new_version.id, 'fid': file_id})
Delta Storage (Reducing S3 Costs)
-- For text files: store diffs instead of full content
-- For binary files (videos, images): store full content each version
def store_version_delta(old_s3_key, new_content_bytes):
old_content = s3.get_object(Bucket='files-bucket', Key=old_s3_key)['Body'].read()
# Compute binary delta
delta = bsdiff4.diff(old_content, new_content_bytes)
# If delta is larger than full content (e.g., image edits), store full copy
if len(delta) > len(new_content_bytes) * 0.9:
return None, new_content_bytes # Store full
return delta, None # Store delta
-- FileVersion additions for delta storage:
delta_base_version_id BIGINT -- which version to apply delta against
storage_type TEXT DEFAULT 'full' -- 'full' or 'delta'
Key Interview Points
- Restore creates a new version, never mutates history: Immutable version history is the core invariant. Restoration means “create version N+1 with the content of version K”, not “overwrite version N”.
- Content-hash deduplication: If a user saves a file with identical content to a previous version, skip creating a new version. Content hash (SHA-256) detects this without comparing bytes.
- S3 key strategy: Never use the version number alone as the S3 key — if two files have version 1, keys collide. Use file_id/version_number/hash_prefix or file_id/uuid for uniqueness.
- Version number vs ID for cursor pagination: Use version_number (not the auto-increment ID) as the cursor for version history pagination — it’s semantically meaningful and monotonically increasing per file.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why should file restoration create a new version instead of reverting?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Immutable version history is the core invariant of any version control system. If restoration modifies an existing version, the history becomes untrustworthy — users cannot see what the file looked like at a given point in time. Creating a new version (N+1 with the content of version K) preserves the complete audit trail: "Alice edited to version 3, Bob reverted to version 1 content at version 4." This makes the history linear and append-only. The same principle applies to payment records, audit logs, and any system where historical correctness is important.”}},{“@type”:”Question”,”name”:”How does content-hash deduplication save storage?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Before creating a new version, compute SHA-256(file_bytes) and check if any existing FileVersion for the same file has the same hash. If a match is found, skip creating a new S3 object — point the new version record to the existing S3 key. This saves storage when users save a file without changes (e.g., hit save without editing). It also saves storage when a user reverts to a previous version — restoration reuses the original S3 object rather than uploading a second copy of the same bytes. Hash computation is cheap (~10ms for a 10MB file); S3 PUT cost ($0.005/1K requests) adds up across millions of saves.”}},{“@type”:”Question”,”name”:”What is the S3 key naming strategy for versioned files?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Never use just the version_number as the S3 key (e.g., file-123/v1) — two different files both have version 1, so keys would collide. Use file_id + version_number + content_hash prefix: uploads/{file_id}/{version_number}/{hash[:8]}. The file_id provides uniqueness across files. The version_number makes the key human-readable and sorted. The hash prefix ensures that if a version row is re-created (after a partial failure), it does not conflict with the previous attempt’s S3 object. For file restoration: point the new version record to the old S3 key directly — no new upload needed.”}},{“@type”:”Question”,”name”:”How do you implement version diffing for text files?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”For text files, compute a unified diff between adjacent versions using the Myers diff algorithm (git’s core algorithm, available as the diff-match-patch library in most languages). Store the delta in S3 instead of the full file content when the delta is smaller than the full file (typically true for incremental edits). On retrieval: start from the nearest full snapshot and apply deltas forward. Limit the delta chain length to prevent slow reconstruction — every 10 versions, store a full snapshot as an anchor point. For binary files (images, PDFs), delta compression is rarely effective; store full copies.”}},{“@type”:”Question”,”name”:”How do you enforce version retention limits per user plan?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Free plan: keep last 30 versions. Pro plan: keep last 500 versions. Enterprise: unlimited. Store the plan’s version_limit on the Tenant or User record. After each new version is created, check: SELECT COUNT(*) FROM FileVersion WHERE file_id=:fid. If count > limit, identify the oldest versions beyond the limit (ORDER BY version_number ASC LIMIT excess_count) and schedule them for deletion. Mark for deletion first (soft delete), verify the S3 object is not referenced by other versions (content_hash deduplication means one S3 object can be referenced by multiple FileVersion rows), then delete S3 object and the FileVersion row.”}}]}
File version control and document history design is discussed in Google system design interview questions.
File version control and document collaboration design is covered in Atlassian system design interview preparation.
File versioning and S3 storage system design is discussed in Amazon system design interview guide.