Designing a Git hosting platform like GitHub tests your ability to handle version control storage at scale, real-time collaboration (PRs, issues, code review), and distributed CI/CD pipelines. This question appears at GitHub, GitLab, Atlassian, and Microsoft.
Requirements Clarification
- Scale: 100M users, 500M repositories, 1B git operations/day (clone, push, fetch)
- Features: Git hosting, pull requests, code review, issues, Actions (CI/CD), code search
- Repository sizes: Median 1MB; P95 100MB; some monorepos 50GB+
- Availability: 99.99% for git operations (critical for developers); 99.9% for web UI
- Latency: git clone/fetch <5s for median repo; push <2s
Repository Storage Architecture
"""
Git Object Model:
blob → file contents (deduplicated by content hash)
tree → directory listing (blob + sub-tree references)
commit → snapshot (tree ref + parent commits + metadata)
tag → named pointer to commit
Pack files: git compresses objects into .pack files using delta compression.
Advantage: 95%+ storage savings (most file versions differ slightly)
Challenge: reading a file may require replaying many deltas
Storage strategy for GitHub-scale:
1. Store git repos on distributed file system (GFS, Ceph, custom)
2. Replicate to 3+ nodes; route reads to nearest replica
3. Separate "hot" repos (active) from "cold" (archived) with tiered storage
4. Deduplicate pack objects across forks (fork networks share objects)
"""
from dataclasses import dataclass
from typing import Optional, List
import hashlib
@dataclass
class GitObject:
obj_type: str # blob | tree | commit | tag
data: bytes
oid: str = "" # SHA-1 or SHA-256 content hash
def __post_init__(self):
header = f"{self.obj_type} {len(self.data)}".encode()
self.oid = hashlib.sha256(header + self.data).hexdigest()
class ObjectStore:
"""Simplified content-addressable object storage."""
def __init__(self, backend):
self.backend = backend # e.g., S3, Ceph, local disk
def write(self, obj: GitObject) -> str:
key = f"objects/{obj.oid[:2]}/{obj.oid[2:]}"
if not self.backend.exists(key):
self.backend.put(key, obj.data)
return obj.oid
def read(self, oid: str) -> Optional[bytes]:
key = f"objects/{oid[:2]}/{oid[2:]}"
return self.backend.get(key)
def exists(self, oid: str) -> bool:
return self.backend.exists(f"objects/{oid[:2]}/{oid[2:]}")
Git Operations at Scale
"""
Smart HTTP protocol for git:
git fetch/clone → GET /info/refs?service=git-upload-pack
→ POST /git-upload-pack (negotiate + pack transfer)
git push → GET /info/refs?service=git-receive-pack
→ POST /git-receive-pack (pack upload + update refs)
Scaling challenges:
1. Clone of large repos: stream pack data from storage; don't buffer in memory
2. Concurrent pushes: use distributed lock (Zookeeper/Redis) per ref to prevent conflicts
3. Fork networks: forks share pack objects; only store deltas from parent
4. Shallow clones: git clone --depth=1 → send only recent commits (GitHub default for Actions)
"""
import redis
import contextlib
class RefLockManager:
"""Distributed locking for git push operations — prevents ref conflicts."""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
@contextlib.contextmanager
def lock_refs(self, repo_id: str, refs: List[str], timeout: int = 30):
"""Acquire per-ref locks atomically. Release on exit."""
lock_keys = sorted(f"reflock:{repo_id}:{ref}" for ref in refs)
acquired = []
try:
for key in lock_keys:
if not self.redis.set(key, "1", nx=True, ex=timeout):
raise Exception(f"Ref conflict: {key} is locked")
acquired.append(key)
yield
finally:
if acquired:
self.redis.delete(*acquired)
class PushHandler:
def handle_push(self, repo_id: str, push_data: dict, lock_mgr: RefLockManager):
ref_updates = push_data["refs"] # [{ref, old_oid, new_oid}, ...]
refs = [u["ref"] for u in ref_updates]
with lock_mgr.lock_refs(repo_id, refs):
# 1. Validate objects exist in store
# 2. Verify fast-forward (or force push permission)
# 3. Run pre-receive hooks
# 4. Update ref store atomically
# 5. Trigger post-receive hooks (CI, notifications, webhook)
self._atomic_update_refs(repo_id, ref_updates)
self._trigger_webhooks(repo_id, ref_updates)
def _atomic_update_refs(self, repo_id: str, updates: list):
pass # Write to distributed ref database
def _trigger_webhooks(self, repo_id: str, updates: list):
pass # Publish to event queue → CI/CD, notifications
Pull Request and Code Review System
from dataclasses import dataclass, field
from typing import List, Optional
from enum import Enum
class PRStatus(Enum):
OPEN = "open"
MERGED = "merged"
CLOSED = "closed"
DRAFT = "draft"
class ReviewState(Enum):
APPROVED = "approved"
CHANGES_REQUESTED = "changes_requested"
COMMENTED = "commented"
@dataclass
class PullRequest:
id: int
repo_id: int
author_id: int
title: str
base_branch: str # e.g., "main"
head_branch: str # e.g., "feature/auth"
base_sha: str # Merge base commit
head_sha: str # Latest commit on head branch
status: PRStatus = PRStatus.OPEN
reviewers: List[int] = field(default_factory=list)
labels: List[str] = field(default_factory=list)
class DiffService:
"""Compute and cache diffs for PR display."""
def __init__(self, git_backend, cache):
self.git = git_backend
self.cache = cache
def get_diff(self, repo_id: int, base_sha: str, head_sha: str) -> dict:
cache_key = f"diff:{repo_id}:{base_sha}:{head_sha}"
cached = self.cache.get(cache_key)
if cached:
return cached
# Compute diff server-side
diff = self.git.diff(repo_id, base_sha, head_sha)
# Cache computed diff (content-addressed — never changes for same SHAs)
self.cache.set(cache_key, diff)
return diff
class MergeService:
def merge_pr(self, pr: PullRequest, merge_strategy: str = "merge_commit") -> dict:
"""
Merge strategies:
merge_commit: creates explicit merge commit (preserves history)
squash: squashes all commits to single commit (clean linear history)
rebase: replays commits on top of base (linear, no merge commit)
"""
strategies = {
"merge_commit": self._merge_commit,
"squash": self._squash_merge,
"rebase": self._rebase_merge,
}
fn = strategies.get(merge_strategy, self._merge_commit)
return fn(pr)
def _merge_commit(self, pr: PullRequest) -> dict:
# git merge --no-ff head_sha
return {"strategy": "merge_commit", "pr_id": pr.id}
def _squash_merge(self, pr: PullRequest) -> dict:
# git merge --squash head_sha; git commit
return {"strategy": "squash", "pr_id": pr.id}
def _rebase_merge(self, pr: PullRequest) -> dict:
# git rebase base_sha head_sha; fast-forward base
return {"strategy": "rebase", "pr_id": pr.id}
Code Search at Scale
"""
GitHub code search challenges:
- 500M repositories, trillions of lines of code
- Must index content of ALL files (not just metadata)
- Search must understand code syntax (not just plain text)
- Updates: new pushes must appear in search within minutes
Architecture:
1. Push event → extract changed files → Kafka topic
2. Indexing workers: tokenize code, extract symbols, write to search engine
3. Zoekt (GitHub's open-source engine) or Elasticsearch for serving
4. Shard by repository; replicate for availability
5. Language-aware tokenization (identifiers, strings, comments parsed separately)
"""
SEARCH_INDEX_MAPPING = {
"settings": {
"analysis": {
"tokenizer": {
"code_tokenizer": {
"type": "pattern",
"pattern": r"[^a-zA-Z0-9_]", # Split on non-identifier chars
}
},
"analyzer": {
"code_analyzer": {
"type": "custom",
"tokenizer": "code_tokenizer",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"properties": {
"repo_id": {"type": "keyword"},
"repo_name": {"type": "keyword"},
"file_path": {"type": "keyword"},
"language": {"type": "keyword"},
"content": {"type": "text", "analyzer": "code_analyzer"},
"symbols": {"type": "keyword"}, # Function/class names for symbol search
"indexed_at":{"type": "date"},
}
}
}
def search_code(es, query: str, language=None, repo=None, limit=20) -> dict:
filters = []
if language:
filters.append({"term": {"language": language}})
if repo:
filters.append({"term": {"repo_name": repo}})
return es.search(index="code", body={
"query": {
"bool": {
"must": [{"match": {"content": {"query": query, "operator": "and"}}}],
"filter": filters,
}
},
"highlight": {"fields": {"content": {"number_of_fragments": 3}}},
"size": limit,
})
Architecture Overview
| Component | Technology | Reason |
|---|---|---|
| Git object store | Custom + Ceph/S3 | Content-addressed; deduplicated pack files |
| Ref database | MySQL + distributed lock | Atomic ref updates; strong consistency required |
| Code search | Zoekt / Elasticsearch | Code-aware tokenization; symbol search |
| Events/webhooks | Kafka | Decouple push events from CI, notifications |
| CI/CD orchestration | Custom (Actions) | Job queue → ephemeral runner VMs |
| Web API | Ruby on Rails / Go | Rails for rapid dev; Go services for hot paths |
| CDN | Fastly | Raw file serving, asset delivery |
Handling Monorepos
- Problem: Monorepos (50GB+) make clone impractical and diff computation expensive
- Sparse checkout:
git sparse-checkout— fetch only required directories - Partial clone:
git clone --filter=blob:none— skip blobs until accessed - Virtual filesystem (VFSforGit): Microsoft’s solution — fake filesystem that fetches on demand
- CODEOWNERS: assign reviewers automatically per directory — scale code review
- Merge queues: serialize PRs targeting main to prevent merge conflicts under high PR volume
Companies That Ask This System Design Question
This problem type commonly appears in interviews at:
See our company interview guides for full interview process, compensation, and preparation tips.