System Design Interview: Design a Code Review and Pull Request Platform

What Is a Code Review Platform?

A code review platform allows developers to propose code changes (pull requests), receive line-by-line feedback from peers, run automated checks, and merge into the main branch. Examples: GitHub Pull Requests, GitLab Merge Requests, Gerrit. Core challenges: diff computation at scale, comment threading, CI integration, merge conflict detection, and real-time collaboration on reviews.

  • Shopify Interview Guide
  • Databricks Interview Guide
  • Cloudflare Interview Guide
  • Meta Interview Guide
  • LinkedIn Interview Guide
  • Atlassian Interview Guide
  • System Requirements

    Functional

    • Create pull requests from a feature branch to a base branch
    • Display unified and split diffs of changed files
    • Comment on specific lines; reply to comments (threads)
    • Request reviews from specific users
    • CI status checks (tests, linting) required before merge
    • Merge the PR (merge commit, squash, rebase)

    Non-Functional

    • 100M repositories, 10M PRs/day created
    • Diff computation for large PRs (10K line changes) in <2s
    • Review comments searchable and persistent

    Core Data Model

    repositories: id, owner_id, name, default_branch, visibility
    pull_requests: id, repo_id, author_id, title, body, head_sha,
                   base_branch, status, created_at, merged_at
    pr_reviews: id, pr_id, reviewer_id, state(approved/changes_requested/commented)
    review_comments: id, pr_id, reviewer_id, file_path, line_number,
                     diff_hunk, body, thread_id, created_at
    ci_checks: id, pr_id, check_name, status, url, started_at, completed_at
    

    Diff Computation

    Computing a diff between two commits (head_sha vs base_sha) involves:

    1. Retrieve the file trees for both commits from Git object store
    2. Find changed files: files present in one tree but not the other, or with different blob hashes
    3. For each changed file: run Myers diff algorithm to produce a unified diff (O(ND) where N is lines and D is edit distance)

    For large PRs with many files: compute diffs in parallel across workers. Cache computed diffs keyed by (base_sha, head_sha, file_path) — diffs are immutable once committed. Serve from cache on subsequent page loads.

    Storing Comments with Diff Context

    A review comment is anchored to a specific line in a specific file at a specific commit. The challenge: if the author pushes a new commit, lines shift. Store the diff_hunk (the surrounding context lines) with each comment. When rendering the comment on the new diff, match the diff_hunk to find the current position of the comment. If the hunk cannot be found (the code was deleted), mark the comment as outdated.

    CI Check Integration

    When a PR is created or updated (new commit pushed): publish a “pr_updated” event to Kafka. CI runners consume this event and start the configured checks (build, test, lint). Each check reports status via a webhook: POST /repos/{repo}/statuses/{sha}. The platform updates ci_checks table and re-evaluates merge eligibility (all required checks must be passing). Required checks are configured per branch protection rules.

    Merge Operations

    • Merge commit: creates a merge commit preserving full history. git merge –no-ff.
    • Squash merge: combines all PR commits into one. Cleaner history for feature branches with many “WIP” commits.
    • Rebase merge: replays PR commits on top of base branch. Linear history, but SHA changes (commits are re-created).

    On merge: acquire a distributed lock for the repository (prevent concurrent merges causing conflicts). Check for merge conflicts (merge-base three-way merge). If clean: perform the merge, update branch pointer, delete the feature branch (optional). Release lock.

    Merge Queue

    When many PRs target the same branch simultaneously, each one must be tested against the others’ changes to prevent breakage. A merge queue: approved PRs enter the queue; the system batches them, runs CI on the combined batch, and merges atomically if green. This avoids the “works individually, breaks together” problem. GitHub’s merge queue uses this model.

    Notifications and Review Requests

    When a reviewer is requested: notify via email and in-app notification. When a review is submitted: notify the PR author. Track unread review comments per user. Use CODEOWNERS file to automatically request reviews from the team owning changed files.

    Interview Tips

    • Diff caching by (base_sha, head_sha, file_path) is the key performance optimization.
    • Comment anchoring to diff_hunk handles the “lines shift after new commits” problem.
    • Merge lock prevents concurrent merges to the same branch.
    • CI as a webhook integration keeps the PR platform decoupled from CI systems.

    {
    “@context”: “https://schema.org”,
    “@type”: “FAQPage”,
    “mainEntity”: [
    {
    “@type”: “Question”,
    “name”: “How do you compute and cache diffs efficiently for a code review platform?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Diff computation uses the Myers diff algorithm (O(ND) where N is total lines and D is edit distance). For a PR with 100 changed files: compute diffs in parallel across workers. Each diff is deterministic — given the same (base_sha, head_sha, file_path), the result is always identical. Cache aggressively: key = "diff:{base_sha}:{head_sha}:{file_path}", store in Redis or a blob store like S3. TTL: never expire (SHAs are immutable once committed). On the first load of a PR, compute and cache all diffs. On subsequent loads: serve from cache. Cache hit rate approaches 100% after the first viewer. For very large diffs (10K+ lines): split into chunks and stream to the client progressively. Compute a diff summary (number of additions/deletions per file) separately from the full diff — the summary is shown in the file tree sidebar and loads instantly. Only compute the full diff for files the user actually expands.” }
    },
    {
    “@type”: “Question”,
    “name”: “How do review comments stay anchored to the right lines when new commits are pushed?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “When a reviewer leaves a comment on line 42 of file.py, and the author pushes a new commit that adds 10 lines before line 42, the comment should appear on the new line 52 — not line 42. Implementation: store each comment with its diff_hunk (3-5 lines of context surrounding the commented line). When rendering the PR with the new commit, re-compute the diff between the original commented commit and the new head. Use the diff_hunk as a fingerprint to find the comment's new position via fuzzy matching. If the hunk is found: display the comment at its new line. If the hunk is not found (the code was modified or deleted): mark the comment as "outdated" and display it in a collapsed section. This is exactly how GitHub handles it. The diff_hunk must be stored at comment creation time — it cannot be recomputed later if the original commit is garbage collected. Storing line_number is insufficient alone; it becomes stale after any surrounding change.” }
    },
    {
    “@type”: “Question”,
    “name”: “How does a merge queue prevent the "works individually but breaks together" problem?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Without a merge queue: PR A passes CI, PR B passes CI, both merge to main — but together they break main because they have conflicting semantic changes that don't cause merge conflicts (test A passes with A's code, test B passes with B's code, neither test covers the interaction). A merge queue solves this by testing PRs in combination before merging. When a PR is approved and ready to merge, it enters the queue. The queue processor: (1) takes the next batch of PRs from the queue, (2) creates a temporary "draft merge" commit that applies all PRs in the batch on top of current main, (3) runs the full CI suite on this combined commit, (4) if green: merges all PRs in the batch atomically; if red: identifies the failing PR (bisect the batch), removes it from the queue, and re-runs CI on the remaining PRs. GitHub's merge queue uses this model. Trade-off: higher CI throughput is consumed (testing combinations), but it guarantees main is always green. Essential for large teams with high PR velocity.” }
    }
    ]
    }

    Scroll to Top