System Design: Design GitHub — Code Hosting, Git Internals, Pull Requests, CI/CD, Code Review, Repository Scaling

GitHub hosts 200+ million repositories for 100+ million developers. Designing a code hosting platform tests your understanding of distributed version control (Git internals), collaborative workflows (pull requests, code review), CI/CD integration, and scaling storage for billions of Git objects. This guide covers the architecture from repository creation to merge — a unique system design question that combines storage, collaboration, and developer tooling.

Git Storage Internals

Git stores data as content-addressed objects in a DAG (directed acyclic graph): (1) Blob — the content of a file. Identified by SHA-1 hash of the content. Two files with identical content share the same blob. (2) Tree — a directory listing. Maps filenames to blob hashes (files) or other tree hashes (subdirectories). (3) Commit — a snapshot. Points to a tree (the project state at that commit), parent commit(s) (history), author, committer, timestamp, and message. (4) Tag — a named reference to a commit (typically a release). A repository is a collection of these objects plus references (branches, tags) pointing to commits. Branches are just pointers: “main” points to a commit hash. A push updates the reference. Storage: Git objects are stored in a packfile — a compressed archive of objects using delta compression. Similar objects (different versions of the same file) are stored as deltas from a base object. A repository with 100,000 commits and 10 GB of source code might compress to 500 MB in packfiles. GitHub stores packfiles in a distributed storage system with replication across multiple datacenters.

Repository Hosting Architecture

When a user pushes code: (1) The Git client sends a pack of new objects via the Git smart HTTP protocol or SSH. (2) The receiving server validates the objects (correct format, valid references) and checks permissions (does this user have push access?). (3) The objects are written to the repository storage backend. (4) References are updated atomically (the branch pointer moves to the new commit). (5) Webhooks are triggered (CI/CD, notifications, integrations). Storage backend: GitHub uses a distributed filesystem (custom-built, similar to HDFS) to store repository data. Each repository is assigned to a storage node. Repositories are replicated across 3 nodes for durability. Hot repositories (popular open-source projects with many clones) are cached at edge locations. For large monorepos (Google-scale): Git struggles with repositories containing millions of files. Solutions: sparse checkout (only download needed directories), partial clone (download objects on demand instead of the full history), and virtual filesystem (Microsoft GVFS/Scalar — intercepts file access and fetches from the server on demand). GitHub uses these techniques for large enterprise repositories.

Pull Requests and Code Review

A pull request (PR) proposes merging changes from one branch into another. Data model: pr_id, source_branch, target_branch, author, title, description, status (open/merged/closed), reviewers, comments, checks (CI status), created_at, merged_at. Diff computation: GitHub computes the diff between the source branch tip and the merge base (the common ancestor of source and target). This shows only the changes introduced by the PR, not unrelated changes on the target branch. For large diffs: GitHub truncates the display and offers a download link. The diff is computed server-side using Git diff algorithms. Syntax highlighting and annotation are added for the web UI. Code review: reviewers comment on specific lines. Each comment references: file path, line number (in the diff), and the commit SHA. This ensures comments remain valid even if the code changes — the comment is anchored to a specific version. Review states: approved, changes requested, commented. Branch protection rules enforce: minimum number of approvals required, all CI checks must pass, no force pushes, and reviewers must re-approve after new commits. Merge strategies: merge commit (preserves full branch history), squash merge (combines all commits into one), and rebase merge (replays commits on top of target).

CI/CD Integration

GitHub Actions: CI/CD workflows defined in YAML (.github/workflows/). Triggered by events: push, pull_request, schedule, workflow_dispatch. When a PR is opened or updated: (1) GitHub fires a pull_request event. (2) Matching workflows start on GitHub-hosted or self-hosted runners. (3) Each workflow runs steps: checkout code, install dependencies, run tests, build, deploy. (4) Status checks report back to the PR: pending, success, failure. The PR page shows the status of each check. Branch protection can require all checks to pass before merging. Check Runs API: CI systems (Jenkins, CircleCI) report status via the Checks API. Each check run has: name, status, conclusion, and optional annotations (inline code comments pointing to specific test failures). Webhooks: on every repository event (push, PR, issue, comment), GitHub sends an HTTP POST to configured webhook URLs. This enables: external CI triggers, Slack notifications, deployment automation, and custom bots. GitHub Apps: a more structured integration model than webhooks. Apps have fine-grained permissions, installation-scoped tokens, and can act on behalf of the app (not a user). Used by: Dependabot, CodeQL, and thousands of marketplace integrations.

Search and Discovery

Code search: GitHub indexes the source code of all public repositories (and private repos for the owning organization) for full-text search. The index covers: file contents, file paths, repository names, and commit messages. Search syntax: language:python “def train_model” finds Python files containing that function definition. The code search index is built using a custom search engine optimized for code (understanding syntax, identifiers, and file structure). Repository discovery: trending repositories (based on star velocity), topic pages (curated collections by technology), and personalized recommendations (based on starred repos and contribution history). GitHub Copilot integration: the AI coding assistant uses the repository context (open files, recent edits, and the broader codebase) to generate code suggestions. This is a read-only integration — Copilot reads code but does not modify the repository directly. Issues and Projects: lightweight project management. Issues track bugs and features. Projects (kanban boards) organize issues into workflows. These are stored as database records (not Git objects) with full-text search indexing.

Scroll to Top