System Design Interview: Design GitHub (Git Hosting at Scale)

⏱ 6 min read

Designing a Git hosting platform like GitHub tests your ability to handle version control storage at scale, real-time collaboration (PRs, issues, code review), and distributed CI/CD pipelines. This question appears at GitHub, GitLab, Atlassian, and Microsoft.

Requirements Clarification

Scale: 100M users, 500M repositories, 1B git operations/day (clone, push, fetch)
Features: Git hosting, pull requests, code review, issues, Actions (CI/CD), code search
Repository sizes: Median 1MB; P95 100MB; some monorepos 50GB+
Availability: 99.99% for git operations (critical for developers); 99.9% for web UI
Latency: git clone/fetch <5s for median repo; push <2s

Repository Storage Architecture

"""
Git Object Model:
  blob     → file contents (deduplicated by content hash)
  tree     → directory listing (blob + sub-tree references)
  commit   → snapshot (tree ref + parent commits + metadata)
  tag      → named pointer to commit

Pack files: git compresses objects into .pack files using delta compression.
  Advantage: 95%+ storage savings (most file versions differ slightly)
  Challenge: reading a file may require replaying many deltas

Storage strategy for GitHub-scale:
  1. Store git repos on distributed file system (GFS, Ceph, custom)
  2. Replicate to 3+ nodes; route reads to nearest replica
  3. Separate "hot" repos (active) from "cold" (archived) with tiered storage
  4. Deduplicate pack objects across forks (fork networks share objects)
"""

from dataclasses import dataclass
from typing import Optional, List
import hashlib

@dataclass
class GitObject:
    obj_type: str  # blob | tree | commit | tag
    data: bytes
    oid: str = ""  # SHA-1 or SHA-256 content hash

    def __post_init__(self):
        header = f"{self.obj_type} {len(self.data)}".encode()
        self.oid = hashlib.sha256(header + self.data).hexdigest()

class ObjectStore:
    """Simplified content-addressable object storage."""
    def __init__(self, backend):
        self.backend = backend  # e.g., S3, Ceph, local disk

    def write(self, obj: GitObject) -> str:
        key = f"objects/{obj.oid[:2]}/{obj.oid[2:]}"
        if not self.backend.exists(key):
            self.backend.put(key, obj.data)
        return obj.oid

    def read(self, oid: str) -> Optional[bytes]:
        key = f"objects/{oid[:2]}/{oid[2:]}"
        return self.backend.get(key)

    def exists(self, oid: str) -> bool:
        return self.backend.exists(f"objects/{oid[:2]}/{oid[2:]}")

Git Operations at Scale

"""
Smart HTTP protocol for git:
  git fetch/clone → GET /info/refs?service=git-upload-pack
                 → POST /git-upload-pack (negotiate + pack transfer)
  git push       → GET /info/refs?service=git-receive-pack
                 → POST /git-receive-pack (pack upload + update refs)

Scaling challenges:
  1. Clone of large repos: stream pack data from storage; don't buffer in memory
  2. Concurrent pushes: use distributed lock (Zookeeper/Redis) per ref to prevent conflicts
  3. Fork networks: forks share pack objects; only store deltas from parent
  4. Shallow clones: git clone --depth=1 → send only recent commits (GitHub default for Actions)
"""

import redis
import contextlib

class RefLockManager:
    """Distributed locking for git push operations — prevents ref conflicts."""
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    @contextlib.contextmanager
    def lock_refs(self, repo_id: str, refs: List[str], timeout: int = 30):
        """Acquire per-ref locks atomically. Release on exit."""
        lock_keys = sorted(f"reflock:{repo_id}:{ref}" for ref in refs)
        acquired = []
        try:
            for key in lock_keys:
                if not self.redis.set(key, "1", nx=True, ex=timeout):
                    raise Exception(f"Ref conflict: {key} is locked")
                acquired.append(key)
            yield
        finally:
            if acquired:
                self.redis.delete(*acquired)

class PushHandler:
    def handle_push(self, repo_id: str, push_data: dict, lock_mgr: RefLockManager):
        ref_updates = push_data["refs"]  # [{ref, old_oid, new_oid}, ...]
        refs = [u["ref"] for u in ref_updates]

        with lock_mgr.lock_refs(repo_id, refs):
            # 1. Validate objects exist in store
            # 2. Verify fast-forward (or force push permission)
            # 3. Run pre-receive hooks
            # 4. Update ref store atomically
            # 5. Trigger post-receive hooks (CI, notifications, webhook)
            self._atomic_update_refs(repo_id, ref_updates)
            self._trigger_webhooks(repo_id, ref_updates)

    def _atomic_update_refs(self, repo_id: str, updates: list):
        pass  # Write to distributed ref database

    def _trigger_webhooks(self, repo_id: str, updates: list):
        pass  # Publish to event queue → CI/CD, notifications

Pull Request and Code Review System

from dataclasses import dataclass, field
from typing import List, Optional
from enum import Enum

class PRStatus(Enum):
    OPEN    = "open"
    MERGED  = "merged"
    CLOSED  = "closed"
    DRAFT   = "draft"

class ReviewState(Enum):
    APPROVED          = "approved"
    CHANGES_REQUESTED = "changes_requested"
    COMMENTED         = "commented"

@dataclass
class PullRequest:
    id:          int
    repo_id:     int
    author_id:   int
    title:       str
    base_branch: str  # e.g., "main"
    head_branch: str  # e.g., "feature/auth"
    base_sha:    str  # Merge base commit
    head_sha:    str  # Latest commit on head branch
    status:      PRStatus = PRStatus.OPEN
    reviewers:   List[int] = field(default_factory=list)
    labels:      List[str] = field(default_factory=list)

class DiffService:
    """Compute and cache diffs for PR display."""
    def __init__(self, git_backend, cache):
        self.git = git_backend
        self.cache = cache

    def get_diff(self, repo_id: int, base_sha: str, head_sha: str) -> dict:
        cache_key = f"diff:{repo_id}:{base_sha}:{head_sha}"
        cached = self.cache.get(cache_key)
        if cached:
            return cached

        # Compute diff server-side
        diff = self.git.diff(repo_id, base_sha, head_sha)

        # Cache computed diff (content-addressed — never changes for same SHAs)
        self.cache.set(cache_key, diff)
        return diff

class MergeService:
    def merge_pr(self, pr: PullRequest, merge_strategy: str = "merge_commit") -> dict:
        """
        Merge strategies:
          merge_commit:  creates explicit merge commit (preserves history)
          squash:        squashes all commits to single commit (clean linear history)
          rebase:        replays commits on top of base (linear, no merge commit)
        """
        strategies = {
            "merge_commit": self._merge_commit,
            "squash":       self._squash_merge,
            "rebase":       self._rebase_merge,
        }
        fn = strategies.get(merge_strategy, self._merge_commit)
        return fn(pr)

    def _merge_commit(self, pr: PullRequest) -> dict:
        # git merge --no-ff head_sha
        return {"strategy": "merge_commit", "pr_id": pr.id}

    def _squash_merge(self, pr: PullRequest) -> dict:
        # git merge --squash head_sha; git commit
        return {"strategy": "squash", "pr_id": pr.id}

    def _rebase_merge(self, pr: PullRequest) -> dict:
        # git rebase base_sha head_sha; fast-forward base
        return {"strategy": "rebase", "pr_id": pr.id}

Code Search at Scale

"""
GitHub code search challenges:
  - 500M repositories, trillions of lines of code
  - Must index content of ALL files (not just metadata)
  - Search must understand code syntax (not just plain text)
  - Updates: new pushes must appear in search within minutes

Architecture:
  1. Push event → extract changed files → Kafka topic
  2. Indexing workers: tokenize code, extract symbols, write to search engine
  3. Zoekt (GitHub's open-source engine) or Elasticsearch for serving
  4. Shard by repository; replicate for availability
  5. Language-aware tokenization (identifiers, strings, comments parsed separately)
"""

SEARCH_INDEX_MAPPING = {
    "settings": {
        "analysis": {
            "tokenizer": {
                "code_tokenizer": {
                    "type": "pattern",
                    "pattern": r"[^a-zA-Z0-9_]",  # Split on non-identifier chars
                }
            },
            "analyzer": {
                "code_analyzer": {
                    "type": "custom",
                    "tokenizer": "code_tokenizer",
                    "filter": ["lowercase"]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "repo_id":   {"type": "keyword"},
            "repo_name": {"type": "keyword"},
            "file_path": {"type": "keyword"},
            "language":  {"type": "keyword"},
            "content":   {"type": "text", "analyzer": "code_analyzer"},
            "symbols":   {"type": "keyword"},  # Function/class names for symbol search
            "indexed_at":{"type": "date"},
        }
    }
}

def search_code(es, query: str, language=None, repo=None, limit=20) -> dict:
    filters = []
    if language:
        filters.append({"term": {"language": language}})
    if repo:
        filters.append({"term": {"repo_name": repo}})

    return es.search(index="code", body={
        "query": {
            "bool": {
                "must": [{"match": {"content": {"query": query, "operator": "and"}}}],
                "filter": filters,
            }
        },
        "highlight": {"fields": {"content": {"number_of_fragments": 3}}},
        "size": limit,
    })

Architecture Overview

Component	Technology	Reason
Git object store	Custom + Ceph/S3	Content-addressed; deduplicated pack files
Ref database	MySQL + distributed lock	Atomic ref updates; strong consistency required
Code search	Zoekt / Elasticsearch	Code-aware tokenization; symbol search
Events/webhooks	Kafka	Decouple push events from CI, notifications
CI/CD orchestration	Custom (Actions)	Job queue → ephemeral runner VMs
Web API	Ruby on Rails / Go	Rails for rapid dev; Go services for hot paths
CDN	Fastly	Raw file serving, asset delivery

Handling Monorepos

Problem: Monorepos (50GB+) make clone impractical and diff computation expensive
Sparse checkout: git sparse-checkout — fetch only required directories
Partial clone: git clone --filter=blob:none — skip blobs until accessed
Virtual filesystem (VFSforGit): Microsoft’s solution — fake filesystem that fetches on demand
CODEOWNERS: assign reviewers automatically per directory — scale code review
Merge queues: serialize PRs targeting main to prevent merge conflicts under high PR volume

Companies That Ask This System Design Question

This problem type commonly appears in interviews at:

See our company interview guides for full interview process, compensation, and preparation tips.