Core Entities
PullRequest:
pr_id, repo_id, author_id, title, description
source_branch, target_branch
status: OPEN | MERGED | CLOSED | DRAFT
created_at, updated_at, merged_at
ReviewComment:
comment_id, pr_id, reviewer_id
file_path, line_number, commit_sha
body, comment_type: LINE | FILE | GENERAL
status: OPEN | RESOLVED
parent_comment_id (for threads)
created_at
Review:
review_id, pr_id, reviewer_id
status: APPROVED | CHANGES_REQUESTED | COMMENTED
submitted_at
CICheck:
check_id, pr_id, name, status: PENDING | RUNNING | PASSED | FAILED
url, details_url, created_at, completed_at
Diff Computation
Server-side diff using the Myers diff algorithm (also used by git). Output stored as unified diff format.
Design decisions:
- Lazy computation: diffs are computed only when the PR is first viewed, not on push. A background job can pre-warm for active PRs.
- Cache key:
(base_commit_sha, head_commit_sha, file_path). Store in Redis with TTL or in object storage (S3) for persistence. - Large file handling: skip binary files, truncate diffs beyond a configurable line limit (e.g., 10,000 lines). Show a warning and link to raw diff.
def get_diff(base_commit, head_commit, file_path):
cache_key = f"diff:{base_commit}:{head_commit}:{file_path}"
cached = redis.get(cache_key)
if cached:
return cached
diff = compute_myers_diff(
get_file_at(base_commit, file_path),
get_file_at(head_commit, file_path)
)
redis.setex(cache_key, ttl=3600, value=diff)
return diff
Inline Comment Threading
Comments are anchored to (file_path, line_number, commit_sha). The challenge: when new commits are pushed, previously commented lines may shift or change.
Line position mapping: when a new commit arrives, compute a diff between the comment’s anchor commit and the new HEAD. Walk the diff hunks to map old line numbers to new positions. If the line was deleted or changed significantly, mark the comment as OUTDATED.
def map_comment_position(comment, new_commit):
diff = get_diff(comment.commit_sha, new_commit, comment.file_path)
for hunk in parse_hunks(diff):
if hunk.old_start <= comment.line_number <= hunk.old_start + hunk.old_count:
offset = comment.line_number - hunk.old_start
if offset < hunk.new_count:
return hunk.new_start + offset # line still exists
else:
return None # line was removed - mark OUTDATED
return comment.line_number # unchanged region
GitHub calls these “outdated” comments and grays them out in the UI while keeping them accessible.
Review State Machine and Merge Eligibility
A PR can only be merged when all conditions are met:
- All required reviewers have submitted APPROVED reviews
- No reviewer has an active CHANGES_REQUESTED review
- All CI checks are in PASSED state
- No OPEN comments with blocking=true
def is_merge_eligible(pr_id):
# Check required approvals
required = get_required_reviewers(pr_id) # from branch protection rules
approved = set(
r['reviewer_id'] for r in
db.query("SELECT reviewer_id FROM reviews WHERE pr_id=? AND status='APPROVED'", pr_id)
)
if not required.issubset(approved):
return False, "Missing required approvals"
# Check no CHANGES_REQUESTED
blocking = db.query(
"SELECT COUNT(*) FROM reviews WHERE pr_id=? AND status='CHANGES_REQUESTED'", pr_id
)
if blocking[0][0] > 0:
return False, "Changes requested"
# All CI checks passed
failing_checks = db.query(
"SELECT COUNT(*) FROM ci_checks WHERE pr_id=? AND status != 'PASSED'", pr_id
)
if failing_checks[0][0] > 0:
return False, "CI checks not all passed"
# No open blocking comments
open_comments = db.query(
"SELECT COUNT(*) FROM review_comments WHERE pr_id=? AND status='OPEN'", pr_id
)
if open_comments[0][0] > 0:
return False, "Unresolved comments"
return True, "Eligible"
CI Integration via Webhooks
Flow for CI integration:
- Outbound: when a PR is created or a new commit is pushed, our system sends a POST to the CI provider (GitHub Actions, Jenkins, CircleCI) with
{pr_id, repo, head_commit_sha, clone_url}. - Inbound: CI posts status updates back to our webhook endpoint as checks start, progress, and complete.
- Storage: each update upserts a CICheck record by
(pr_id, check_name). - Real-time UI: the PR page subscribes to SSE or WebSocket. When a CICheck row updates, push the new status to all subscribers for that PR.
POST /webhooks/ci
{
"pr_id": "pr_456",
"check_name": "unit-tests",
"status": "PASSED",
"details_url": "https://ci.example.com/builds/789",
"completed_at": "2025-01-15T10:30:00Z"
}
# Handler:
def handle_ci_webhook(payload):
upsert_ci_check(
pr_id=payload['pr_id'],
name=payload['check_name'],
status=payload['status'],
url=payload['details_url'],
completed_at=payload['completed_at']
)
publish_event(f"pr:{payload['pr_id']}", {
"type": "ci_update",
"check": payload['check_name'],
"status": payload['status']
})
Scale Considerations
- Diff storage: for monorepos with thousands of files per PR, store per-file diffs in S3, not in the DB. DB stores only metadata and references.
- Notification fan-out: use a message queue (Kafka/SQS) for CI events – decouple the webhook receiver from downstream processing.
- Read replicas: PR list views and diff rendering are read-heavy. Use read replicas and aggressive caching.
- Access control: enforce repo-level permissions on every API call – comment reads, diff fetches, and merge actions all require authorization checks.
See also: Atlassian Interview Guide
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture