Question 1

How do you compute and cache diffs efficiently for a code review platform?

Accepted Answer

Diff computation uses the Myers diff algorithm (O(ND) where N is total lines and D is edit distance). For a PR with 100 changed files: compute diffs in parallel across workers. Each diff is deterministic — given the same (base_sha, head_sha, file_path), the result is always identical. Cache aggressively: key = "diff:{base_sha}:{head_sha}:{file_path}", store in Redis or a blob store like S3. TTL: never expire (SHAs are immutable once committed). On the first load of a PR, compute and cache all diffs. On subsequent loads: serve from cache. Cache hit rate approaches 100% after the first viewer. For very large diffs (10K+ lines): split into chunks and stream to the client progressively. Compute a diff summary (number of additions/deletions per file) separately from the full diff — the summary is shown in the file tree sidebar and loads instantly. Only compute the full diff for files the user actually expands.

Question 2

How do review comments stay anchored to the right lines when new commits are pushed?

Accepted Answer

When a reviewer leaves a comment on line 42 of file.py, and the author pushes a new commit that adds 10 lines before line 42, the comment should appear on the new line 52 — not line 42. Implementation: store each comment with its diff_hunk (3-5 lines of context surrounding the commented line). When rendering the PR with the new commit, re-compute the diff between the original commented commit and the new head. Use the diff_hunk as a fingerprint to find the comment's new position via fuzzy matching. If the hunk is found: display the comment at its new line. If the hunk is not found (the code was modified or deleted): mark the comment as "outdated" and display it in a collapsed section. This is exactly how GitHub handles it. The diff_hunk must be stored at comment creation time — it cannot be recomputed later if the original commit is garbage collected. Storing line_number is insufficient alone; it becomes stale after any surrounding change.

Question 3

How does a merge queue prevent the "works individually but breaks together" problem?

Accepted Answer

Without a merge queue: PR A passes CI, PR B passes CI, both merge to main — but together they break main because they have conflicting semantic changes that don't cause merge conflicts (test A passes with A's code, test B passes with B's code, neither test covers the interaction). A merge queue solves this by testing PRs in combination before merging. When a PR is approved and ready to merge, it enters the queue. The queue processor: (1) takes the next batch of PRs from the queue, (2) creates a temporary "draft merge" commit that applies all PRs in the batch on top of current main, (3) runs the full CI suite on this combined commit, (4) if green: merges all PRs in the batch atomically; if red: identifies the failing PR (bisect the batch), removes it from the queue, and re-runs CI on the remaining PRs. GitHub's merge queue uses this model. Trade-off: higher CI throughput is consumed (testing combinations), but it guarantees main is always green. Essential for large teams with high PR velocity.

System Design Interview: Design a Code Review and Pull Request Platform

What Is a Code Review Platform?

System Requirements

Functional

Non-Functional

Core Data Model

Diff Computation

Storing Comments with Diff Context

CI Check Integration

Merge Operations

Merge Queue

Notifications and Review Requests

Interview Tips