Q: How does auto-moderation work for a comment system at scale?

Auto-moderation prevents spam and harmful content without requiring manual review of every comment. Multi-layer approach: (1) Rule-based pre-filters (synchronous, O(1)): blocked word list, URL pattern matching (detect spam links), rate limiting (max K comments per hour per user), duplicate detection (same comment posted multiple times in the last N minutes — hash the normalized body). Reject immediately if any rule triggers. (2) ML toxicity classifier (asynchronous): after the comment is posted (status=PENDING_REVIEW), send to a text classification model (Perspective API or fine-tuned BERT). If toxicity score > 0.9: auto-hide (status=HIDDEN). If 0.7-0.9: flag for human review (status=FLAGGED). If < 0.7: publish (status=ACTIVE). (3) Community reporting: if a comment receives N distinct user reports, auto-hide pending review. Escalate to human moderators. Human moderators review FLAGGED and HIDDEN comments, can override the ML decision. Track false positive rate — too many false positives erode trust.

Question 1

How do you store and query nested (threaded) comments efficiently?

Accepted Answer

Three data models for hierarchical comments: (1) Adjacency list (parent_comment_id column): simplest. Fetching the full thread requires recursive SQL (WITH RECURSIVE). Good for shallow trees (2-3 levels). O(depth) queries. (2) Path enumeration (LTREE in PostgreSQL): each comment stores its full path from root: "root_id.child_id.grandchild_id". Fetch all descendants with path ~ 'root_id.*'. Insert requires knowing parent's path. O(1) query for subtree. Path string grows with depth. (3) Closure table: a separate table with (ancestor_id, descendant_id, depth) rows for every ancestor-descendant pair. Fetching all descendants: SELECT descendant_id FROM closure WHERE ancestor_id = X. Inserting a new comment at depth D inserts D+1 rows. Best for arbitrary depth with frequent subtree queries. For most production comment systems (Reddit, YouTube), path enumeration or closure table is preferred. Limit maximum nesting depth (e.g., 8 levels) to prevent degenerate trees and simplify rendering.

Question 2

How does Reddit-style sorting (hot, top, new, controversial) work for comments?

Accepted Answer

New: ORDER BY created_at DESC. Simple but buries popular older comments. Top: ORDER BY (upvotes - downvotes) DESC. Problem: a comment with 100 upvotes and 0 downvotes ranks same as one with 200 upvotes and 100 downvotes. Wilson score lower bound: statistically sound ranking for items with positive/negative votes. Formula: (p + z²/(2n) - z*sqrt(p*(1-p)/n + z²/(4n²))) / (1 + z²/n), where p = upvotes/(upvotes+downvotes), z = 1.96 (95% CI), n = total votes. Low-vote comments rank conservatively. Hot (time-decay): score = (upvotes - downvotes) / (age_hours + 2)^1.5. Recent comments with votes rank higher than old comments with the same votes. Controversial: score = (total_votes) * min(ups, downs) / max(ups, downs). High total votes AND close up/down ratio signals controversy. Precompute scores and store them in the score column; recompute on every vote. Build a covering index on (content_item_id, score DESC, comment_id) for efficient pagination.

Question 3

How do you handle vote race conditions in a comment system?

Accepted Answer

Two users voting on the same comment simultaneously. Without atomicity: both reads see upvotes=5, both increment to 6, one write is lost. Solutions: (1) UPSERT for the vote record: INSERT INTO CommentVote (comment_id, user_id, vote) VALUES ... ON CONFLICT (comment_id, user_id) DO UPDATE SET vote = EXCLUDED.vote. The unique constraint prevents duplicate votes atomically. (2) Atomic counter update: UPDATE Comment SET upvotes = upvotes + 1 WHERE comment_id = X. This is an atomic increment in PostgreSQL — no read-modify-write race. (3) Changing vote (up → down): UPDATE Comment SET upvotes = upvotes - 1, downvotes = downvotes + 1 WHERE comment_id = X AND (SELECT vote FROM CommentVote WHERE ...) = 'UP'. Wrap in transaction. (4) Redis atomic: HINCRBY comment:{id} upvotes 1 with Lua script for check-and-increment. Use the DB approach (options 1+2) for correctness; add Redis caching of vote counts for read performance.

Question 4

How does cursor-based pagination work for comment loading?

Accepted Answer

Offset pagination (LIMIT 20 OFFSET 40) is unstable for comments: new comments or votes can shift the ordering, causing duplicates or missed items between pages. Cursor pagination: the cursor encodes the position in the sorted order. For comments sorted by (score DESC, comment_id DESC): cursor = (last_score, last_comment_id). Next page query: WHERE (score, comment_id) < (cursor_score, cursor_comment_id) ORDER BY score DESC, comment_id DESC LIMIT 20. This is a keyset pagination query — it skips exactly the items already seen without re-reading them. The composite cursor (score, comment_id) handles ties in score: if two comments have score=0.8, use comment_id as the tiebreaker to ensure a stable ordering. Encode the cursor as base64 for the client. Index required: CREATE INDEX ON Comment (content_item_id, score DESC, comment_id DESC) for efficient keyset pagination.

Question 5

How does auto-moderation work for a comment system at scale?

Accepted Answer

Auto-moderation prevents spam and harmful content without requiring manual review of every comment. Multi-layer approach: (1) Rule-based pre-filters (synchronous, O(1)): blocked word list, URL pattern matching (detect spam links), rate limiting (max K comments per hour per user), duplicate detection (same comment posted multiple times in the last N minutes — hash the normalized body). Reject immediately if any rule triggers. (2) ML toxicity classifier (asynchronous): after the comment is posted (status=PENDING_REVIEW), send to a text classification model (Perspective API or fine-tuned BERT). If toxicity score > 0.9: auto-hide (status=HIDDEN). If 0.7-0.9: flag for human review (status=FLAGGED). If < 0.7: publish (status=ACTIVE). (3) Community reporting: if a comment receives N distinct user reports, auto-hide pending review. Escalate to human moderators. Human moderators review FLAGGED and HIDDEN comments, can override the ML decision. Track false positive rate — too many false positives erode trust.

Comment System Low-Level Design

Requirements

Data Model

Nested Comments: Closure Table vs Path Enumeration

Comment Sorting

Vote Handling and Race Conditions

Auto-Moderation

Key Design Decisions