System Design Interview: Code Repository and CI/CD (GitHub / GitLab)

Q: How does Git store data efficiently using content-addressable storage?

Git stores every piece of data as a content-addressed object: the filename (key) is the SHA-1 or SHA-256 hash of the content. Objects come in four types: blobs (file contents), trees (directory listings), commits (snapshot metadata with parent pointers), and tags. Because the key is derived from content, two identical files automatically share one blob object — deduplication is built in. A commit SHA is deterministic given its content: the author, timestamp, message, and the root tree SHA. This means any change to any file in any subdirectory produces a completely different commit SHA, making tampering detectable. Packfiles apply delta compression on top: Git groups similar blobs and stores only the differences (delta encoding), achieving 10-50x compression. The Linux kernel repository has ~1M commits and ~80K files but the packfile is only ~2GB.

Q: How does a pull request review workflow work at scale?

A pull request is a database record linking two Git refs (head branch and base branch) with metadata: author, reviewers, status, comments. The diff is computed by finding the merge base commit (common ancestor), then comparing the head and base trees. For large diffs, computation is deferred — the server generates and caches the diff on first view request. Real-time collaboration works via a pub/sub system: when any participant (reviewer or author) makes a change (comments, suggestions, approvals), an event is published to a message broker (Kafka). A WebSocket gateway maintains persistent connections with all online PR participants and pushes the event to all connected clients. At GitHub's scale (hundreds of millions of PRs), the PR metadata lives in sharded PostgreSQL; diffs and large text blobs in object storage (S3); hot PR data (recent activity, open PRs) cached in Redis.

Q: How does CI/CD pipeline dispatch work on a platform like GitHub Actions?

CI/CD pipeline dispatch is event-driven: every Git event (push, pull request, tag, scheduled time) is published to an internal event bus. The pipeline dispatcher consumes events and evaluates which workflow files (.github/workflows/*.yml) match the event based on trigger conditions (branch patterns, path filters, event types). Matching workflows are decomposed into jobs, and each job is enqueued to a distributed task queue partitioned by organization — ensuring that one organization with thousands of concurrent builds cannot starve another. Runner agents (virtual machines or containers) poll their assigned queues, dequeue jobs, spin up isolated execution environments (Docker containers or Firecracker microVMs for security isolation), clone the repository at the exact commit SHA, execute job steps sequentially, and report status and logs back to the platform API. Autoscaling monitors queue depth: when depth exceeds a threshold, new runner instances are launched (EC2 spot, AKS node pools). Runners scale to zero after idle timeout to minimize cost.

⏱ 9 min read

GitHub hosts 300M repositories and processes 2B+ Git operations daily. Designing a code hosting platform combines distributed storage (Git objects), real-time collaboration (pull requests, code review), and event-driven automation (CI/CD pipelines). This question appears at GitHub, GitLab, Atlassian (Bitbucket), and at large tech companies building internal developer platforms.

Requirements

Functional: Host Git repositories (push/pull/clone). Create and review pull requests. Trigger CI/CD pipelines on push and PR events. Search code across repositories. View commit history, diffs, and blame. Manage access control (org → team → repo → branch permissions).

Non-functional: Clone of a large repo (linux kernel, 1GB) completes in under 30 seconds. PR creation is reflected to all reviewers within 2 seconds. 99.99% uptime — a Git hosting outage blocks all development. Scale: 100M repositories, 50M developers, 10M push events/day.

Git Object Model

Git stores four object types as content-addressed blobs (SHA-1/SHA-256 hash of content):

Blob: file content. Two files with the same content share one blob (deduplication by default).
Tree: directory listing mapping filenames to blob/tree SHA hashes.
Commit: author, timestamp, message, pointer to root tree, pointer to parent commit(s).
Tag: named pointer to a commit with optional GPG signature.

The content-addressable model means Git naturally deduplicates: if you add the same file twice, only one blob is stored. Packfiles (delta compression): Git groups objects into packfiles and stores deltas between similar blobs, achieving 10-50x compression. A packfile for the Linux kernel (~80K files, 1M commits) is ~2GB on disk vs ~50GB uncompressed.

Storage Architecture

Repository data is stored in a distributed object store, not a traditional database:

Small repos (<1GB): store packfiles directly in S3 or GCS. Packfile path: s3://repos/{org}/{repo}/{pack_sha}.pack
Large repos (monorepos): Google and Meta maintain custom Git servers (Gitiles, Mercurial-based Sapling) with virtual filesystem layers (GVFS) that fetch only the files actually accessed, enabling 300GB monorepos to clone in seconds
Object cache: Redis or Memcached caches hot objects (recent commits, branch tips). Reduces S3 reads for popular repos

Git refs (branch names, HEAD) are stored in a key-value store (PostgreSQL or etcd): ref_name → commit_sha. Atomic ref updates are critical — two concurrent pushes to the same branch must not both succeed (optimistic locking with compare-and-swap).

Pull Request Workflow

A pull request (PR) is a database record: PR_id, author, head_ref, base_ref, title, body, status (open/merged/closed), review_requests, comments. Storage: relational database (PostgreSQL) for PR metadata and review state; object store for the diff.

Diff computation: compute the merge base between head and base branches, then generate a unified diff. For large PRs (10K line diffs), diff is computed lazily on first view and cached. Syntax highlighting is applied server-side with a tree-sitter parser per language.

Real-time updates: when a reviewer adds a comment, an event is published to a Kafka topic. A WebSocket gateway (Pusher, ActionCable, or custom) fans out the event to all PR participants who have the page open. The PR page subscribes to a channel identified by PR_id.

CI/CD Pipeline Dispatch

Every push event triggers pipeline evaluation:

Git server publishes a push event to Kafka: {repo_id, commit_sha, ref, pusher}
Pipeline dispatcher consumes the event, loads the pipeline config (.github/workflows/*.yml or .gitlab-ci.yml), evaluates which pipelines match the event (branch filters, path filters)
Dispatcher enqueues jobs to a task queue (Redis/SQS) partitioned by organization (fair queuing — one noisy org cannot starve others)
Runner agents (GitHub Actions runners, GitLab runners) pick jobs from the queue, clone the repo at the commit SHA, execute job steps inside a container (Docker or Firecracker microVM for isolation), and report status back via API
Status updates are published via WebSocket to the PR page and repository dashboard

Runner autoscaling: runners are EC2 spot instances or Kubernetes pods. The dispatcher monitors queue depth and triggers autoscaling when queue depth exceeds threshold. Scale-to-zero after 10 minutes of inactivity to reduce cost.

Code Search

GitHub Code Search indexes ~200TB of code across 300M repositories. Architecture:

Indexing pipeline: on push, enqueue the diff (changed files) for indexing. The indexer tokenizes code using language-aware tokenizers (split on CamelCase, underscores, special characters), builds inverted index shards, and merges into the search cluster
Search backend: custom Elasticsearch-like index optimized for code (trigram index for substring search, language filter, repo/org filter). Trigram index: index every 3-character substring — enables arbitrary regex matching
Query parsing: support symbols (language:python function:authenticate), regex (/pattern/), file path filters (path:src/auth), owner filters (org:github)

Freshness: new pushes appear in search within 60 seconds. Achieve by processing the incremental diff (only changed files) rather than re-indexing the full repo.

Access Control

Permissions form a hierarchy: Organization → Team → Repository → Branch. At each push, the Git server checks: does the pusher have write access to this repo? Can they push to this branch (branch protection rules)? Do required status checks pass? This check runs on the hot path of every push — cached in Redis with a 5-minute TTL to avoid database reads per operation.

Monorepo at Scale

Google has a single monorepo (Piper) with 80K engineers, 2B lines of code, and 45K commits/day. Key techniques:

Virtual filesystem (GVFS, Google’s Fuse client): sparse checkout — the working tree only materializes files actually opened. Clone takes seconds instead of hours
Build graph: Bazel/Blaze computes which build targets are affected by a change using a file→target dependency graph. Only affected tests run in CI
Code ownership: CODEOWNERS files (GitHub) or OWNERS files (Google) define who must approve changes to each directory

Frequently Asked Questions

How does Git store data efficiently using content-addressable storage?

Git stores every piece of data as a content-addressed object: the filename (key) is the SHA-1 or SHA-256 hash of the content. Objects come in four types: blobs (file contents), trees (directory listings), commits (snapshot metadata with parent pointers), and tags. Because the key is derived from content, two identical files automatically share one blob object — deduplication is built in. A commit SHA is deterministic given its content: the author, timestamp, message, and the root tree SHA. This means any change to any file in any subdirectory produces a completely different commit SHA, making tampering detectable. Packfiles apply delta compression on top: Git groups similar blobs and stores only the differences (delta encoding), achieving 10-50x compression. The Linux kernel repository has ~1M commits and ~80K files but the packfile is only ~2GB.

How does a pull request review workflow work at scale?

A pull request is a database record linking two Git refs (head branch and base branch) with metadata: author, reviewers, status, comments. The diff is computed by finding the merge base commit (common ancestor), then comparing the head and base trees. For large diffs, computation is deferred — the server generates and caches the diff on first view request. Real-time collaboration works via a pub/sub system: when any participant (reviewer or author) makes a change (comments, suggestions, approvals), an event is published to a message broker (Kafka). A WebSocket gateway maintains persistent connections with all online PR participants and pushes the event to all connected clients. At GitHub's scale (hundreds of millions of PRs), the PR metadata lives in sharded PostgreSQL; diffs and large text blobs in object storage (S3); hot PR data (recent activity, open PRs) cached in Redis.

How does CI/CD pipeline dispatch work on a platform like GitHub Actions?

CI/CD pipeline dispatch is event-driven: every Git event (push, pull request, tag, scheduled time) is published to an internal event bus. The pipeline dispatcher consumes events and evaluates which workflow files (.github/workflows/*.yml) match the event based on trigger conditions (branch patterns, path filters, event types). Matching workflows are decomposed into jobs, and each job is enqueued to a distributed task queue partitioned by organization — ensuring that one organization with thousands of concurrent builds cannot starve another. Runner agents (virtual machines or containers) poll their assigned queues, dequeue jobs, spin up isolated execution environments (Docker containers or Firecracker microVMs for security isolation), clone the repository at the exact commit SHA, execute job steps sequentially, and report status and logs back to the platform API. Autoscaling monitors queue depth: when depth exceeds a threshold, new runner instances are launched (EC2 spot, AKS node pools). Runners scale to zero after idle timeout to minimize cost.

Companies That Ask This Question

Atlassian Engineering Interview Guide

LinkedIn Engineering Interview Guide

HashiCorp Engineering Interview Guide

Stripe Engineering Interview Guide