Version Control System Low-Level Design: Object Store, Branching Model, and Merge Algorithm

Object Types

A version control system stores three object types in a content-addressed object store:

Blob: raw file content. The blob has no filename — it is purely the bytes of the file.
Tree: a directory listing. Each entry maps a filename to a blob SHA (for files) or a tree SHA (for subdirectories), plus a file mode.
Commit: a snapshot of the root tree plus metadata — tree_sha, parent_commit_sha[] (empty for initial commit, two for merge commits), author, committer, message, timestamp.

This three-object model allows efficient representation: if two commits share 95% of their files unchanged, those files' blobs and their containing trees are shared — only the changed objects are new.

Content Addressing

Each object's SHA-1 is computed from its content. The object is stored at .git/objects/{sha[:2]}/{sha[2:]}. Deduplication is automatic: committing the same file twice produces the same SHA and stores only one blob. Content addressing also provides integrity checking — if an object's content does not match its SHA, it has been corrupted.

Branch Model

A branch is simply a mutable text file containing a commit SHA. .git/refs/heads/main contains the SHA of the latest commit on main. HEAD is a pointer to the current branch (or directly to a commit SHA in detached HEAD state). This makes branch creation O(1) — just write a new file with a commit SHA.

Two merge strategies:

Fast-forward: if the target branch is a direct ancestor of the source branch, simply advance the branch pointer to the source tip. No merge commit created.
Three-way merge: common ancestor is found; a merge commit with two parents is created.

Three-Way Merge Algorithm

Given branches B and C to merge, find their common ancestor A (lowest common ancestor in the commit DAG). Compute diffs A→B and A→C. Apply both diffs to A:

If only one branch changed a region: take that branch's change
If both branches changed the same region identically: take the change once
If both branches changed the same region differently: conflict — mark the region with conflict markers and require manual resolution

The recursive merge strategy handles criss-cross merges (where there is more than one possible common ancestor) by first merging the multiple ancestors into a virtual base. The ort strategy is an optimized reimplementation of recursive that avoids checking out the virtual base.

Pack Files and Delta Compression

As a repository accumulates history, loose objects are consolidated into pack files. Pack file construction:

Select objects to pack (all loose objects, or objects reachable from certain refs)
For similar objects (e.g., two versions of the same file), compute a binary delta: store only the delta rather than the full content of both
Compress the entire pack with zlib
Write a pack index file for O(log n) object lookup by SHA

Delta compression can reduce repository size by 10x or more for repositories with large binary files that change incrementally.

Garbage Collection

Objects that are not reachable from any ref (branch, tag, or the reflog) are unreachable. GC identifies these objects and deletes them after a grace period (default 2 weeks, to protect objects referenced by in-flight operations). GC also packs loose objects and consolidates multiple pack files. It runs automatically on a heuristic (after a certain number of loose objects accumulate) or explicitly.

Shallow Clone and Grafts

git clone --depth N fetches only the last N commits per branch. The server sends commits without their full ancestry. The client stores a shallow file listing commit SHAs whose parents were not fetched. These commits are treated as root commits locally. Unshallowing (deepening) fetches additional history from the server on demand.

Reflog

The reflog records every movement of HEAD and branch pointers with a timestamp and reason. It is a local safety net: if a branch pointer is accidentally reset or deleted, the previous commit SHA is still in the reflog for the grace period. git reflog shows this history; git reset --hard HEAD@{3} restores to a previous state. Reflogs are not pushed to remotes — they are local only.

Hooks

Git hooks are scripts in .git/hooks/ that run at specific lifecycle events. Common hooks: pre-commit (run linters, tests before commit — non-zero exit aborts the commit), commit-msg (validate commit message format), pre-push (run tests before push — non-zero exit aborts the push), post-receive (server-side: trigger CI, update issue trackers after a push is accepted). Hooks are not version-controlled by default — teams distribute them via a setup script or a hook manager like Husky.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How would you design the object store for a version control system like Git?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use a content-addressable store keyed by SHA-256 hash of the object's content. Store four object types: blobs (file contents), trees (directory listings with mode, name, and child hash), commits (tree pointer, parent commit hashes, author metadata, message), and tags. Write objects immutably to a flat-file store partitioned by the first two hex digits of the hash (e.g., .git/objects/ab/cd1234…) to limit directory fan-out. Pack loose objects periodically into packfiles with delta compression to reduce storage. A content-addressable design gives deduplication and integrity verification for free: if two files have identical content, they share one blob object.”
}
},
{
“@type”: “Question”,
“name”: “What branching model data structures allow efficient branch creation and deletion in a VCS?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Represent branches as mutable named pointers (refs) to a commit hash stored in a lightweight file or a key-value ref database. Because a branch is just a 40-byte SHA pointer, creation and deletion are O(1) — no data is copied. Maintain a special HEAD ref that either holds a branch name (attached HEAD) or a direct commit hash (detached HEAD). For remote refs, namespace them under refs/remotes//. Use a packed-refs file for repos with thousands of branches to avoid per-file syscall overhead. The DAG of commit objects itself is immutable; only the ref pointers move, making branch operations cheap regardless of history depth.”
}
},
{
“@type”: “Question”,
“name”: “Walk through the three-way merge algorithm used when merging two branches.”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Find the merge base — the most recent common ancestor commit — using a lowest-common-ancestor traversal of the commit DAG (BFS from both branch tips, intersecting visited sets). For each file, perform a three-way diff: compare base→ours and base→theirs independently. If only one side changed a region, take that change. If both sides changed the same region differently, emit a conflict marker. If both sides made the identical change, accept it without conflict. At the object level, generate a new tree object reflecting the merged file contents, then create a merge commit with two parent pointers. For renames, use similarity scoring (edit distance on content) to detect renames before diffing, preventing spurious conflicts.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle concurrent pushes to the same branch in a distributed VCS server?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use optimistic concurrency with a compare-and-swap on the ref. When a client pushes, it sends the current (expected) tip hash and the new tip hash. The server atomically updates the ref only if the current value matches the expected value — reject with a 'non-fast-forward' error otherwise. Implement this via a per-ref advisory lock (flock on the ref file, or a distributed lock in Redis/ZooKeeper for clustered servers) held only for the duration of the CAS write. For high-throughput hosting services, queue pushes to the same branch and process them serially to avoid lock contention. Store incoming pack objects before acquiring the lock so the critical section is just the ref pointer swap, keeping lock hold time under a millisecond.”
}
}
]
}