How does a content moderation pipeline work at scale?

Multi-stage pipeline: (1) Pre-publication filters (

How do content moderation systems handle appeals and policy changes?

Appeals: reviewed by a different, more senior moderator than the original decision-maker. Successful appeals: content is restored AND the example is added as a negative training signal (the model was wrong). Track appeal rate and overturn rate per category and per moderator -- high overturn indicates aggressive thresholds, unclear policy, or insufficient training. Policy versioning: each policy has a version with effective date, rules, examples, and edge cases. When policy changes: retrain classifiers on new labels, update moderator guidelines, re-evaluate recently moderated content under the new policy. Some previously removed content may now be allowed. Transparency reports: publish total removals per category, auto vs human reviewed, appeal stats, and error estimates. Required by EU Digital Services Act and builds user trust.

System Design: Design Content Moderation System — ML Classification, Human Review, Appeals, Policy Enforcement

⏱ 6 min read

Content moderation protects billions of users from harmful content on social platforms, marketplaces, and communication apps. Designing a moderation system tests your understanding of ML classification at scale, human-in-the-loop workflows, policy versioning, appeals processes, and the unique challenge that moderation decisions have real-world consequences for both users and platforms. This is increasingly asked at companies with user-generated content.

Moderation Pipeline

Multi-stage pipeline from content creation to enforcement: (1) Pre-publication filters — before content is visible to others: hash matching (PhotoDNA for known CSAM, perceptual hashing for previously removed content), keyword blocklists (exact match and regex for known harmful terms), and spam classifiers (account age, posting rate, content patterns). These are fast (< 50ms) and catch obvious violations. (2) ML classifiers — after publication (near-real-time, < 5 seconds): image classification (nudity, violence, hate symbols), text classification (hate speech, harassment, self-harm, misinformation), video classification (frame-by-frame analysis + audio transcription + scene understanding). Each classifier outputs a score (0-1) per policy category. (3) Decision engine — combines classifier scores with context (user history, content reach, platform policies) to decide: auto-remove (high confidence violation), send to human review (medium confidence), or allow (low confidence). Thresholds are tuned per category: lower threshold for CSAM (remove first, review later) vs higher for political speech (review before removing). (4) Human review — trained moderators review flagged content against detailed policy guidelines. Decisions: remove, reduce distribution, add warning label, or approve. (5) User notification — if content is removed, notify the creator with: which policy was violated, the specific content, and how to appeal.

ML Classification at Scale

Facebook processes 3+ billion posts per day. Every post must be classified across 10+ policy categories. Architecture: (1) Feature extraction — text: extract embeddings from a multilingual model (XLM-RoBERTa for 100+ languages). Images: extract visual features from a CNN/ViT. Video: sample frames + extract audio transcript. These embeddings are reused across multiple classifiers. (2) Ensemble of classifiers — one per policy category: hate speech, nudity, violence, spam, misinformation, self-harm, terrorism, intellectual property, etc. Each is trained on labeled data specific to that category. (3) Multi-modal fusion — combine text, image, and context features. A meme with innocuous text and innocuous image may be hateful when combined. Multi-modal models (like CLIP) understand text-image relationships. (4) Context features — account age (new accounts are riskier), past violations (repeat offenders have lower thresholds), content virality (fast-spreading content gets priority review), and community norms (what is acceptable in a meme group vs a professional forum). Training data: human-labeled examples of violations and non-violations. Active learning: prioritize labeling examples where the model is uncertain (near the decision boundary) to improve performance efficiently. Multilingual challenges: hate speech in 100+ languages requires language-specific training data and cultural context. A word that is offensive in one culture may be benign in another.

Human Review Operations

ML cannot make all moderation decisions — edge cases, context-dependent content, and evolving policies require human judgment. Review workforce: thousands of trained moderators (internal or outsourced to companies like Accenture, Teleperformance). Each moderator is trained on specific policy areas. Regular calibration sessions ensure consistency. Review queue management: (1) Priority queue — content is prioritized by: severity (CSAM > terrorism > harassment), virality (fast-spreading content reviewed first), and confidence (medium-confidence ML decisions reviewed before low-confidence). (2) Routing — route content to moderators with the right expertise: language match, policy area specialization, and cultural context. (3) Quality assurance — a percentage of decisions are reviewed by senior moderators. Inter-rater reliability is tracked. Moderators with low agreement rates receive additional training. Moderator well-being: content moderators are exposed to disturbing content daily. Mitigation: mandatory breaks, access to counseling, content blurring for the most extreme categories (reviewers can choose to view the blurred version), and rotation between severe and mild content queues. This is both an ethical obligation and a business necessity (high turnover in moderation teams is expensive). Decision SLA: CSAM and terrorism content: review within 1 hour. Hate speech and harassment: review within 24 hours. Copyright claims: review within 48 hours.

Appeals and Policy Versioning

Appeals: users can appeal moderation decisions. The appeal is reviewed by a different moderator (not the original decision-maker) with higher seniority. If the appeal succeeds: the content is restored, and the original decision is used as a training signal (the model made a mistake — add this example to the training set as a negative). Appeal metrics: appeal rate (what percentage of removed content is appealed?), overturn rate (what percentage of appeals succeed?). High overturn rate indicates: the ML model threshold is too aggressive, the policy is unclear, or moderator training is insufficient. Track overturn rate per category and per moderator. Policy versioning: moderation policies change frequently (new regulations, evolving norms, emerging threats). Each policy has a version with: effective date, specific rules, examples, and edge case guidance. When a policy changes: retrain classifiers on the new labels, update moderator guidelines, and re-evaluate recently moderated content under the new policy (some previously removed content may now be allowed, and vice versa). Transparency report: publish regular reports showing: total content removed per category, auto-removed vs human-reviewed, appeal statistics, and false positive/negative estimates. This builds user trust and is increasingly required by regulation (EU Digital Services Act).

Proactive vs Reactive Moderation

Proactive moderation: the platform detects and removes violations before users report them. ML classifiers scan all content at creation time. 95%+ of violations on major platforms are detected proactively (not through user reports). This is the primary defense. Reactive moderation: users report content they find problematic. The report triggers: priority review by a moderator, context check (did other users also report this?), and potential escalation (a post with 100 reports is likely a genuine violation). User report quality: many reports are from disagreement, not actual policy violations. The system must filter: (1) Coordinated reporting (campaigns to remove content by mass-reporting) — detect coordinated behavior and deprioritize. (2) Retaliatory reporting (reporting someone because they reported you) — track report patterns between users. (3) False reports (reporting content that clearly does not violate policy) — deprioritize repeat false reporters. Proactive detection for emerging threats: (1) Trend monitoring — detect new harmful trends (challenges, coded language, new hate symbols) by: monitoring report spikes, tracking new hashtags, and NLP analysis of new linguistic patterns. (2) Adversarial robustness — bad actors evade detection by: misspelling hate terms, using coded language, embedding text in images, and adding noise to bypass image classifiers. The moderation system must continuously adapt: retrain on adversarial examples, deploy new classifiers for emerging patterns, and update hash databases.