Document Versioning System Low-Level Design: Snapshot-Delta Storage, Efficient Reconstruction, and Integrity Checks

Document Versioning System: Low-Level Design

A document versioning system preserves every historical state of a document so users can view, compare, and restore previous versions. The design challenge is balancing storage efficiency (storing full snapshots vs. deltas) against read performance (reconstructing a version at any point in time). This design covers the snapshot-with-delta storage model, three-way merge for concurrent edits, and an efficient diffing strategy.

Core Data Model

CREATE TABLE Document (
    doc_id         BIGSERIAL PRIMARY KEY,
    title          VARCHAR(500) NOT NULL,
    owner_id       BIGINT NOT NULL,
    current_version_id BIGINT,          -- FK set after first version insert
    created_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE DocumentVersion (
    version_id     BIGSERIAL PRIMARY KEY,
    doc_id         BIGINT NOT NULL REFERENCES Document(doc_id) ON DELETE CASCADE,
    version_number INT NOT NULL,        -- 1, 2, 3 ... monotonically increasing
    author_id      BIGINT NOT NULL,
    commit_message VARCHAR(500),
    content_type   VARCHAR(20) NOT NULL DEFAULT 'snapshot',  -- snapshot, delta
    content        TEXT,               -- full content if snapshot; NULL if delta stored separately
    parent_version_id BIGINT REFERENCES DocumentVersion(version_id),
    checksum       VARCHAR(64) NOT NULL,  -- SHA-256 of full content (for integrity)
    size_bytes     INT NOT NULL,
    created_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE (doc_id, version_number)
);

CREATE TABLE VersionDelta (
    delta_id       BIGSERIAL PRIMARY KEY,
    version_id     BIGINT NOT NULL REFERENCES DocumentVersion(version_id) ON DELETE CASCADE,
    patch          TEXT NOT NULL,       -- unified diff format patch
    UNIQUE (version_id)
);

CREATE INDEX ON DocumentVersion(doc_id, version_number DESC);
CREATE INDEX ON DocumentVersion(doc_id, content_type);

Write: Snapshot-with-Delta Storage

import hashlib, difflib
from typing import Optional

SNAPSHOT_INTERVAL = 10  # store full snapshot every N versions

def save_version(doc_id: int, new_content: str, author_id: int,
                 commit_message: str = None) -> int:
    """
    Save a new version. Every Nth version is a snapshot; others are stored as deltas.
    Returns the new version_id.
    """
    latest = db.fetchone("""
        SELECT version_id, version_number, content, content_type
        FROM DocumentVersion
        WHERE doc_id=%s ORDER BY version_number DESC LIMIT 1
    """, (doc_id,))

    new_version_number = (latest['version_number'] + 1) if latest else 1
    checksum = hashlib.sha256(new_content.encode()).hexdigest()
    size_bytes = len(new_content.encode('utf-8'))

    # Store snapshot every SNAPSHOT_INTERVAL versions, or for the first version
    is_snapshot = (new_version_number == 1 or new_version_number % SNAPSHOT_INTERVAL == 0)

    if is_snapshot:
        version_id = db.fetchone("""
            INSERT INTO DocumentVersion
                (doc_id, version_number, author_id, commit_message,
                 content_type, content, parent_version_id, checksum, size_bytes)
            VALUES (%s,%s,%s,%s,'snapshot',%s,%s,%s,%s)
            RETURNING version_id
        """, (doc_id, new_version_number, author_id, commit_message,
              new_content, latest['version_id'] if latest else None,
              checksum, size_bytes))['version_id']
    else:
        # Reconstruct current content for diffing
        current_content = _reconstruct_version(doc_id, latest['version_id'])
        patch = _make_patch(current_content, new_content)

        version_id = db.fetchone("""
            INSERT INTO DocumentVersion
                (doc_id, version_number, author_id, commit_message,
                 content_type, content, parent_version_id, checksum, size_bytes)
            VALUES (%s,%s,%s,%s,'delta',NULL,%s,%s,%s)
            RETURNING version_id
        """, (doc_id, new_version_number, author_id, commit_message,
              latest['version_id'], checksum, size_bytes))['version_id']

        db.execute("""
            INSERT INTO VersionDelta (version_id, patch) VALUES (%s, %s)
        """, (version_id, patch))

    # Update document's current version pointer
    db.execute("""
        UPDATE Document SET current_version_id=%s, updated_at=NOW()
        WHERE doc_id=%s
    """, (version_id, doc_id))

    return version_id

def _make_patch(old: str, new: str) -> str:
    old_lines = old.splitlines(keepends=True)
    new_lines = new.splitlines(keepends=True)
    return ''.join(difflib.unified_diff(old_lines, new_lines, lineterm=''))

def _apply_patch(base: str, patch: str) -> str:
    """Apply a unified diff patch to base content."""
    import patch as patch_lib  # python-patch library
    p = patch_lib.fromstring(patch.encode())
    lines = base.splitlines(keepends=True)
    patched = p.apply(lines)
    return ''.join(patched)

Read: Version Reconstruction

def _reconstruct_version(doc_id: int, target_version_id: int) -> str:
    """
    Reconstruct the content of any version efficiently.
    Finds the nearest snapshot ancestor, then applies deltas forward.
    """
    # Walk back through parent chain to find nearest snapshot
    chain = []  # [(version_id, content_type, patch or content), ...]
    vid = target_version_id
    snapshot_content = None

    while vid:
        row = db.fetchone("""
            SELECT v.version_id, v.content_type, v.content, v.parent_version_id,
                   d.patch
            FROM DocumentVersion v
            LEFT JOIN VersionDelta d ON d.version_id=v.version_id
            WHERE v.version_id=%s
        """, (vid,))
        if not row:
            break
        if row['content_type'] == 'snapshot':
            snapshot_content = row['content']
            break
        chain.append(row)
        vid = row['parent_version_id']

    if snapshot_content is None:
        raise VersionReconstructError(f"No snapshot ancestor found for version {target_version_id}")

    # Apply deltas in order (chain is in reverse order — reverse it)
    content = snapshot_content
    for row in reversed(chain):
        content = _apply_patch(content, row['patch'])

    return content

def get_version(doc_id: int, version_number: int) -> dict:
    version = db.fetchone("""
        SELECT version_id, version_number, author_id, commit_message, created_at, checksum
        FROM DocumentVersion WHERE doc_id=%s AND version_number=%s
    """, (doc_id, version_number))
    if not version:
        raise VersionNotFoundError(f"Version {version_number} not found")
    content = _reconstruct_version(doc_id, version['version_id'])
    # Verify integrity
    actual_checksum = hashlib.sha256(content.encode()).hexdigest()
    if actual_checksum != version['checksum']:
        raise IntegrityError(f"Version {version_number} content checksum mismatch")
    version['content'] = content
    return version

Key Design Decisions

Snapshot every 10 versions: with only deltas, reconstructing version N requires replaying all N-1 patches from the first snapshot — O(N) reconstruction time. A snapshot every 10 versions caps reconstruction at 9 delta applications (worst case). The storage cost: full snapshots add ~10× the average delta size every 10 versions, typically a 30–50% total storage increase over delta-only — a worthwhile trade for O(1) bounded reconstruction.
SHA-256 checksum for integrity: every version stores a checksum of the full reconstructed content. On read, verify the reconstructed content matches the checksum — this detects both storage corruption and bugs in the patch apply logic.
Unified diff format: standard unified diff (difflib.unified_diff) is human-readable, compact, and supported by many libraries. For binary or structured documents (JSON, XML), use a type-specific differ (e.g., jsondiff for JSON) that produces semantic patches rather than line-level patches.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”When should you store full snapshots versus only deltas for document versions?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Delta-only storage minimizes disk usage but makes reconstruction expensive: reading version N requires applying N-1 patches from the beginning. For a document with 500 versions, that is 499 patch applications per read — unacceptable for interactive use. Snapshot-only storage makes reads O(1) but stores the full document on every save — wasteful for large documents with small edits. The hybrid approach (snapshot every N versions, deltas in between) bounds reconstruction to at most N-1 delta applications. Choose N based on typical document size and edit size: if documents are 10KB and average edits change 100 bytes (1%), storing a snapshot every 10 versions means 9 patches of ~100 bytes each vs. one 10KB snapshot — roughly 900 bytes vs. 10KB per 10 versions, a 10× storage saving. For large documents (1MB+) where even 1% edits are substantial, consider snapshot every 5 versions.”}},{“@type”:”Question”,”name”:”How do you implement a visual diff (side-by-side comparison) between two versions?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Visual diff requires: (1) reconstruct both versions’ full content; (2) compute a line-by-line or word-level diff; (3) render with unchanged, added, and removed sections highlighted. The unified diff format (difflib.unified_diff) produces hunks of context + removed lines (-) + added lines (+), which map naturally to a side-by-side view. For richer word-level diffing within changed lines: use SequenceMatcher on the tokenized words of each changed line. Implementation: for each line in a changed hunk, compare tokens between the old and new version to highlight individual word additions/deletions within the line change. Libraries: Python difflib (built-in), js diff (JavaScript), google-diff-match-patch (word/character level). For very large documents: compute diff lazily — only diff the visible portion of the comparison, loading more diffs as the user scrolls.”}},{“@type”:”Question”,”name”:”How do you handle concurrent edits to the same document?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Two users editing the same document simultaneously create a conflict: both save from the same parent version, and both consider their version "current." Resolution approaches: (1) Last-Write-Wins: the second save overwrites the first. Simplest; first writer’s changes are silently lost. Acceptable for documents with infrequent concurrent edits. (2) Fork-and-merge: both saves create new versions branching from the common parent. The system attempts a three-way merge (parent + change A + change B). If the changes affect different sections, the merge succeeds automatically. If they overlap, a merge conflict is raised and the user must resolve it manually. (3) Operational Transform / CRDT: real-time collaboration where every keystroke is transmitted and applied immediately — Google Docs model. The most complex but eliminates merge conflicts by construction. For most document systems without real-time collaboration, fork-and-merge with automatic merge is the right balance.”}},{“@type”:”Question”,”name”:”How do you implement a rollback to a previous version?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Rollback creates a new version whose content equals a prior version’s content — it does not delete the intermediate versions. Never delete versions to "roll back"; always append. Implementation: get_version(doc_id, target_version_number) reconstructs the content, then save_version(doc_id, reconstructed_content, author_id, "Rollback to v{target}") creates a new version at the head. The version history remains intact: v1, v2, v3, v4 (rollback to v2). An audit viewer can see the full history including the rollback event. Store rollback_to_version in the commit message or in a metadata column. This approach is also safer for compliance: audit logs must not be modified, and deleting intermediate versions to clean up would look like tampering.”}},{“@type”:”Question”,”name”:”How do you handle very large documents (10MB+) efficiently in the versioning system?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A 10MB document stored as a snapshot on every 10th version uses 10MB × (total_versions/10) of storage — a 1,000-version document accumulates 1GB in snapshots alone. Optimizations: (1) Content-addressable storage: instead of storing the full document in DocumentVersion, store a hash and keep unique content blobs in an S3-backed content store. Two versions with identical content share the same blob (deduplication). (2) Chunk-level deltas: split the document into fixed-size chunks (e.g., 4KB). Only store deltas for changed chunks. A 10MB document with a 100-byte change generates a delta of one 4KB chunk, not a diff of the full 10MB. (3) Compression: compress snapshot content with zlib or zstd before storage (typical text compression ratio: 5–10×). Combined: a 10MB document compresses to ~1MB, and 1,000 versions with 1% average change rate uses ~1MB (snapshots) + ~100KB (deltas) = ~1.1MB rather than 1GB.”}}]}

Document versioning and revision history system design is discussed in Google system design interview questions.

Document versioning and collaborative editing design is covered in Atlassian system design interview preparation.

Document versioning and file history system design is discussed in Amazon system design interview guide.