Question 1

How does Git store data internally?

Accepted Answer

Git uses content-addressed objects in a DAG: (1) Blob -- file content, identified by SHA-1 hash. Identical files share the same blob. (2) Tree -- directory listing mapping filenames to blob/tree hashes. (3) Commit -- snapshot pointing to a tree, parent commit(s), author, and message. (4) Tag -- named reference to a commit. Branches are just pointers: main points to a commit hash. A push updates the pointer. Storage: objects are packed into packfiles using delta compression -- similar objects stored as deltas from a base. A repo with 100K commits and 10 GB source might compress to 500 MB. GitHub stores packfiles in a distributed filesystem replicated across 3 nodes per repository.

Question 2

How do pull requests and code review work architecturally?

Accepted Answer

A PR proposes merging changes from source branch to target branch. The diff is computed between the source tip and the merge base (common ancestor), showing only the PR changes. Code review: reviewers comment on specific lines anchored to a file path, line number, and commit SHA -- comments remain valid even after new commits. Review states: approved, changes requested, commented. Branch protection enforces: minimum approvals, all CI checks passing, no force pushes, and re-approval after new commits. Merge strategies: merge commit (preserves history), squash (combines into one commit), rebase (replays commits on target). GitHub computes diffs server-side using Git diff algorithms and adds syntax highlighting for the web UI.

Question 3

How does GitHub Actions CI/CD work?

Accepted Answer

Workflows defined in YAML (.github/workflows/) trigger on events: push, pull_request, schedule. When a PR is opened: matching workflows run on GitHub-hosted or self-hosted runners. Steps: checkout code, install dependencies, run tests, build, deploy. Status checks report back to the PR (pending/success/failure). Branch protection can require all checks to pass before merging. External CI systems (Jenkins, CircleCI) integrate via the Checks API, reporting status with optional inline code annotations. Webhooks: on every event (push, PR, comment), GitHub sends HTTP POST to configured URLs -- enabling Slack notifications, deployment automation, and custom bots. GitHub Apps provide fine-grained permissions and installation-scoped tokens for structured integrations.

Question 4

How does GitHub handle very large repositories?

Accepted Answer

Git struggles with repositories containing millions of files or huge histories. Solutions: (1) Sparse checkout -- only download needed directories. A developer working on /frontend does not need /backend files locally. (2) Partial clone -- download objects on demand instead of the full history. Initial clone is fast; objects are fetched when needed (e.g., checking out an old commit). (3) Virtual filesystem (GVFS/Scalar by Microsoft) -- intercepts file access at the OS level, fetching from the server on demand. The local checkout appears complete but most files are not actually downloaded. (4) Git LFS (Large File Storage) -- store large binary files (images, models, datasets) in separate storage, replacing them with lightweight pointers in the repo. The actual binaries are downloaded on checkout. GitHub uses these techniques for enterprise customers with large monorepos.

System Design: Design GitHub — Code Hosting, Git Internals, Pull Requests, CI/CD, Code Review, Repository Scaling

Git Storage Internals

Repository Hosting Architecture

Pull Requests and Code Review

CI/CD Integration

Search and Discovery