What Is a Content Moderation System?
A content moderation system detects and removes harmful content (hate speech, spam, CSAM, misinformation, violence) from user-generated content platforms. Examples: Facebook’s moderation at 100B+ posts, YouTube’s content ID system, Twitter’s spam filters. Core challenges: scale (millions of posts per hour), latency (pre-publish blocking vs. post-publish removal), accuracy (minimize false positives that silence legitimate users), and adversarial content (content designed to evade detection).
System Requirements
Functional
- Classify text/images/video as: safe, borderline, violating
- Auto-remove high-confidence violations immediately
- Queue borderline content for human review
- Appeals: users can contest removal decisions
- Hash-based detection for known violating content (CSAM hashes)
Non-Functional
- 1M posts/hour, <500ms for pre-publish text classification
- Human reviewers handle 100K items/day
- False positive rate <0.1% (do not remove legitimate content)
Multi-Layer Moderation Pipeline
Content submission
│
▼
[Layer 1] Hash matching (PhotoDNA, MD5)
→ exact match: block immediately (O(1))
│
▼
[Layer 2] ML classifier (text: BERT-based, image: CNN)
→ score: high confidence bad → block
→ score: medium confidence → human review queue
→ score: low confidence → allow
│
▼
[Layer 3] Human review workers (for borderline content)
│
▼
[Layer 4] Appeals (for removed content)
Hash-Based Detection
For known illegal content (CSAM), use perceptual hashing (PhotoDNA). Unlike cryptographic hashes (MD5), perceptual hashes are similar for visually similar images — resizing, cropping, or color-adjusting a photo produces a nearly identical perceptual hash. Store known violating hashes in a Bloom filter for O(1) lookup at submission time. The Bloom filter can hold 1B hashes in ~1.2 GB with a 1% false positive rate. Any Bloom filter hit triggers exact hash verification before blocking.
ML Classification
Text: fine-tuned BERT or RoBERTa model, running on GPU inference servers. Pre-publish path: synchronous call with 200ms timeout. If timeout: fall back to allow (accept false negatives rather than blocking legitimate content). Post-publish: re-run classification asynchronously with a more expensive model. Image: ResNet/EfficientNet CNN. Video: sample frames at 1fps, classify each, aggregate frame scores. Return confidence score 0-1 with violation categories.
Human Review Queue
Borderline content (confidence 0.3-0.8) goes to the review queue. Prioritization: order by: (1) content visibility (viral content reviewed first), (2) severity of potential violation, (3) submission time. Each item shown to a reviewer with context: user history, report count, policy reference. Reviewer actions: remove, allow, mark for policy update. Quality control: 5% of reviewed items sent to a senior reviewer to measure inter-rater agreement. Reviewers with low agreement undergo retraining.
Signals for Classification
- Content features: text toxicity, image nudity score, audio transcription
- User signals: account age, prior violations, follower/following ratio
- Graph signals: how many of this user’s posts were reported, by whom
- Velocity signals: posting 100 identical messages in an hour = spam
Appeals and Feedback Loop
Users appeal removed content via a form. Appeals go to a senior review queue. If overturned: content restored, model prediction logged as false positive. These false positive examples are added to the training set with correct label. Continuous retraining pipeline ingests reviewer decisions weekly and updates the classifier. This closes the feedback loop — the model improves from human decisions.
Scaling Human Review
At 1M posts/hour with a 10% borderline rate, that is 100K items/hour. At 200 items/reviewer/hour, need 500 concurrent reviewers. Geographic distribution: native speakers for non-English content. Outsource to moderation vendors (Accenture, Teleperformance) for scale. Protect reviewer mental health: mandatory breaks, psychological support, session content diversity limits.
Interview Tips
- Multi-layer pipeline is the key insight: cheap checks first (hash), expensive last (ML).
- Perceptual hashing + Bloom filter for known-bad content = O(1) rejection.
- Pre-publish vs. post-publish trade-off: latency vs. accuracy.
- Human review and the feedback loop complete the system — don’t design without them.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you build a multi-layer content moderation pipeline and why layer it?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “A multi-layer pipeline applies cheap checks first and expensive checks last, short-circuiting as soon as a decision is made. Layer 1 — hash matching (O(1)): check the content hash against a Bloom filter of known-violating hashes. PhotoDNA for images, MD5 for exact text matches. Any hit triggers immediate block with zero ML cost. Layer 2 — rule-based filters (O(ms)): regex patterns for spam URLs, keyword blocklists, velocity checks (user posting 100 times/minute). Cheap and fast, catches obvious violations. Layer 3 — ML classifier (O(100ms)): BERT for text toxicity, CNN for image nudity. GPU inference. Returns a confidence score and violation categories. Layer 4 — human review: only borderline confidence scores (0.3-0.8) go here. High-confidence violations (> 0.8) are auto-removed; low-confidence (< 0.3) are auto-allowed. This cascade means 95%+ of content is resolved by layers 1-2, and ML only runs on the remaining 5%, dramatically reducing compute cost and latency.” }
},
{
“@type”: “Question”,
“name”: “How does perceptual hashing enable detection of modified copies of violating images?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Cryptographic hashes (MD5, SHA-256) change completely if even one pixel changes — resizing or adding a watermark produces a totally different hash. Perceptual hashes (PhotoDNA, pHash, dHash) hash the image's visual content. They are robust to minor modifications: resizing, cropping, color adjustments, and adding small watermarks produce similar perceptual hashes. Two perceptually similar images produce hashes with low Hamming distance (number of differing bits). Algorithm: dHash converts image to 8×9 grayscale, compares adjacent pixels to produce a 64-bit hash. Two images with Hamming distance <= 10 are considered similar. For CSAM detection: NCMEC maintains a hash database of known CSAM images. Platforms run PhotoDNA on all uploaded images and compare against this database. False positive risk: the low Hamming distance threshold may match legitimate images — human review handles any Bloom filter hits before taking action.” }
},
{
“@type”: “Question”,
“name”: “How do you prioritize the human review queue in a content moderation system?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “With 100K+ items/day in the review queue, reviewers cannot process everything immediately. Prioritization factors: (1) Content velocity — viral content reaching 1M views should be reviewed before content with 10 views; calculate views-per-hour and prioritize high-velocity items. (2) Violation severity — potential CSAM or credible violence threats are P0, hate speech is P1, spam is P2. (3) User report count — 100 users reporting the same content signals higher urgency than zero reports. (4) Account risk signals — content from accounts with prior violations ranks higher. Implementation: weighted priority score = (severity_weight * severity) + (velocity_weight * views_per_hour) + (report_weight * report_count) + (account_risk_weight * account_risk). Store in a priority queue (Redis Sorted Set keyed by score). Reviewers always pull from the top of the queue. SLA: P0 reviewed within 1 hour, P1 within 24 hours, P2 within 72 hours. Measure queue depth by priority tier and alert when P0 SLA is at risk.” }
}
]
}