Question 1

When should you store full snapshots versus only deltas for document versions?

Accepted Answer

Delta-only storage minimizes disk usage but makes reconstruction expensive: reading version N requires applying N-1 patches from the beginning. For a document with 500 versions, that is 499 patch applications per read — unacceptable for interactive use. Snapshot-only storage makes reads O(1) but stores the full document on every save — wasteful for large documents with small edits. The hybrid approach (snapshot every N versions, deltas in between) bounds reconstruction to at most N-1 delta applications. Choose N based on typical document size and edit size: if documents are 10KB and average edits change 100 bytes (1%), storing a snapshot every 10 versions means 9 patches of ~100 bytes each vs. one 10KB snapshot — roughly 900 bytes vs. 10KB per 10 versions, a 10× storage saving. For large documents (1MB+) where even 1% edits are substantial, consider snapshot every 5 versions.

Question 2

How do you implement a visual diff (side-by-side comparison) between two versions?

Accepted Answer

Visual diff requires: (1) reconstruct both versions' full content; (2) compute a line-by-line or word-level diff; (3) render with unchanged, added, and removed sections highlighted. The unified diff format (difflib.unified_diff) produces hunks of context + removed lines (-) + added lines (+), which map naturally to a side-by-side view. For richer word-level diffing within changed lines: use SequenceMatcher on the tokenized words of each changed line. Implementation: for each line in a changed hunk, compare tokens between the old and new version to highlight individual word additions/deletions within the line change. Libraries: Python difflib (built-in), js diff (JavaScript), google-diff-match-patch (word/character level). For very large documents: compute diff lazily — only diff the visible portion of the comparison, loading more diffs as the user scrolls.

Question 3

How do you handle concurrent edits to the same document?

Accepted Answer

Two users editing the same document simultaneously create a conflict: both save from the same parent version, and both consider their version "current." Resolution approaches: (1) Last-Write-Wins: the second save overwrites the first. Simplest; first writer's changes are silently lost. Acceptable for documents with infrequent concurrent edits. (2) Fork-and-merge: both saves create new versions branching from the common parent. The system attempts a three-way merge (parent + change A + change B). If the changes affect different sections, the merge succeeds automatically. If they overlap, a merge conflict is raised and the user must resolve it manually. (3) Operational Transform / CRDT: real-time collaboration where every keystroke is transmitted and applied immediately — Google Docs model. The most complex but eliminates merge conflicts by construction. For most document systems without real-time collaboration, fork-and-merge with automatic merge is the right balance.

Question 4

How do you implement a rollback to a previous version?

Accepted Answer

Rollback creates a new version whose content equals a prior version's content — it does not delete the intermediate versions. Never delete versions to "roll back"; always append. Implementation: get_version(doc_id, target_version_number) reconstructs the content, then save_version(doc_id, reconstructed_content, author_id, "Rollback to v{target}") creates a new version at the head. The version history remains intact: v1, v2, v3, v4 (rollback to v2). An audit viewer can see the full history including the rollback event. Store rollback_to_version in the commit message or in a metadata column. This approach is also safer for compliance: audit logs must not be modified, and deleting intermediate versions to clean up would look like tampering.

Question 5

How do you handle very large documents (10MB+) efficiently in the versioning system?

Accepted Answer

A 10MB document stored as a snapshot on every 10th version uses 10MB × (total_versions/10) of storage — a 1,000-version document accumulates 1GB in snapshots alone. Optimizations: (1) Content-addressable storage: instead of storing the full document in DocumentVersion, store a hash and keep unique content blobs in an S3-backed content store. Two versions with identical content share the same blob (deduplication). (2) Chunk-level deltas: split the document into fixed-size chunks (e.g., 4KB). Only store deltas for changed chunks. A 10MB document with a 100-byte change generates a delta of one 4KB chunk, not a diff of the full 10MB. (3) Compression: compress snapshot content with zlib or zstd before storage (typical text compression ratio: 5–10×). Combined: a 10MB document compresses to ~1MB, and 1,000 versions with 1% average change rate uses ~1MB (snapshots) + ~100KB (deltas) = ~1.1MB rather than 1GB.

Document Versioning System Low-Level Design: Snapshot-Delta Storage, Efficient Reconstruction, and Integrity Checks

Document Versioning System: Low-Level Design

Core Data Model

Write: Snapshot-with-Delta Storage

Read: Version Reconstruction

Key Design Decisions