Content Versioning Service Low-Level Design: Revision History, Diff Storage, and Restore

Content versioning is a foundational capability for any system that manages documents, configuration, or structured data that must be auditable and recoverable. Whether you are building a wiki, a CMS, a database migration tool, or a collaborative editing platform, the versioning service solves the same core problems: how to store revision history efficiently, how to compute diffs, and how to restore any prior state. This post covers the full design.

Requirements

Functional Requirements

Record every change to a document as an immutable revision
Store the diff between consecutive revisions rather than full snapshots for most revisions
Allow comparison between any two arbitrary revisions
Restore a document to any prior revision (point-in-time restore)
Return the full revision history list with author, timestamp, and change summary
Support large documents up to 10MB

Non-Functional Requirements

Write latency under 100ms for creating a new revision
Restore latency under 500ms for any revision within the last 30 days
Storage cost must grow sublinearly with revision count for typical editing patterns

Data Model

The revisions table is append-only. Columns: revision_id (UUID), doc_id, revision_number (monotonically increasing per doc), author_id, created_at, change_summary (short text), storage_type (snapshot or diff), storage_ref (pointer to S3 object or inline blob for small diffs). No row is ever updated or deleted during normal operation.

The snapshots policy controls how often full snapshots are stored versus diffs. A configurable snapshot_interval (default every 50 revisions) ensures that restoring any revision requires applying at most 49 diffs to the nearest prior snapshot. This bounds restore time regardless of total revision count.

Large content blobs are stored in object storage (S3). The storage_ref column holds the S3 key. Small diffs (under 4KB) are stored inline in the database. The content is compressed (zstd) before storage.

Core Algorithms

Diff Storage

For text documents, diffs are computed using Myers diff algorithm, which produces minimal edit scripts (list of insert/delete operations). The diff is serialized as a standard patch format and compressed. For binary or structured content (JSON documents), use JSON Patch (RFC 6902) which records the exact field-level operations applied.

At write time: fetch the previous revision content, compute the diff, store the diff. If the diff size exceeds 80% of the full content size, store a full snapshot instead. This prevents pathological cases (large rewrites) from creating diffs larger than the content itself.

Point-in-Time Restore

To restore revision R: find the nearest snapshot at or before R (call it S). Fetch the snapshot content from S3. Fetch all diffs from revision S+1 to R in order. Apply each diff sequentially. Return the resulting content. The snapshot interval of 50 ensures at most 49 diff applications. Each diff application is O(content_size), so total restore complexity is O(50 * content_size) in the worst case.

Arbitrary Revision Comparison

To compare revisions A and B: reconstruct full content at A, reconstruct full content at B, compute a diff between A and B for display. Reconstruction uses the restore algorithm above. Both reconstructions can run in parallel. The display diff is computed on the fly and not stored.

API Design

POST /docs/{doc_id}/revisions — Create a new revision; body contains full new content and optional change_summary; service computes diff internally
GET /docs/{doc_id}/revisions — List revisions with metadata (no content); supports pagination
GET /docs/{doc_id}/revisions/{revision_id} — Return full reconstructed content for a specific revision
POST /docs/{doc_id}/restore — Restore document to a specified revision_id; creates a new revision (does not overwrite history)
GET /docs/{doc_id}/diff?from={rev_a}&to={rev_b} — Return a human-readable diff between two revisions

Scalability

Revision writes are cheap: a database insert plus an S3 put. The hot path is reading recent revisions. Recent revision content can be cached in Redis with a TTL of 10 minutes, keyed by revision_id. Cache hit rates are high because most reads are for the latest revision or a recent comparison.

Old revisions are migrated to cheaper S3 storage tiers after 90 days. The revisions table is partitioned by doc_id for efficient per-document queries. A background compaction job can optionally collapse old diff chains into periodic snapshots to improve cold restore performance.

Interview Talking Points

Key design decisions to discuss: why append-only storage is critical for auditability and simplicity, how the snapshot interval bounds restore time and storage cost, why you store diffs rather than snapshots for most revisions, and when you fall back to a full snapshot (large rewrites). Interviewers often ask how you handle concurrent edits — the answer is optimistic locking using revision_number as an expected version in the write request.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you implement append-only revision history for content versioning?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Never mutate a stored revision. Each save creates a new row in a revisions table with (content_id, version_number, author_id, created_at, content or delta). The current version is the row with the highest version_number. This preserves full audit history, makes concurrent writes safe via optimistic locking on version_number, and allows any historical version to be retrieved directly.”
}
},
{
“@type”: “Question”,
“name”: “When should you store deltas vs full snapshots for version history?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Deltas (diffs) are storage-efficient for frequent edits to large documents but require replaying a chain of patches to reconstruct any version — slow for old versions. Full snapshots are fast to retrieve but expensive to store. A hybrid strategy stores a full snapshot every N versions (e.g., every 10) and deltas between snapshots, bounding reconstruction cost to at most N patch applications while keeping storage manageable.”
}
},
{
“@type”: “Question”,
“name”: “How do you efficiently compute and display a diff between two content versions?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Retrieve the two version payloads (full text or reconstructed from deltas) and apply a diff algorithm — Myers diff for line-level or Patience diff for structured prose. Represent the result as a sequence of equal/insert/delete hunks. For the UI, render a side-by-side or unified diff view with syntax highlighting. Cache computed diffs keyed by (version_a_id, version_b_id) to avoid recomputation on repeated views.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement point-in-time restore for versioned content?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Point-in-time restore creates a new revision whose content equals the historical version's content rather than reverting in-place. Query the revisions table for the target (content_id, target_version), reconstruct the full text if using deltas, then insert a new row with version_number = max + 1. This preserves the intermediate history and makes the restore itself part of the audit trail, attributable to the restoring user.”
}
}
]
}