What score thresholds are used in ML content moderation?

A common three-band threshold: score > 0.95 triggers auto-block (high confidence violation), score between 0.30 and 0.95 routes to the human review queue, and score < 0.30 is auto-allowed. Thresholds are tuned per classifier and content type to balance precision and recall.

How is the human review queue prioritized in a content moderation system?

Items in the review band are ordered by ML score descending so reviewers see the most likely violations first. This minimizes harm during backlog conditions. The SLA for human review is typically 24 hours.

How does an appeal workflow work in content moderation?

When a user appeals a block, an AppealCase is created linked to the original ModerationJob. A secondary reviewer is assigned and may trigger an ML re-score with the latest model version. The reviewer can uphold the block or reverse it to allow. The user is notified of the outcome.

How do reviewer decisions improve ML content moderation models?

Reviewer allow/block decisions on ML-flagged content are collected as labeled training examples. A weekly retraining pipeline incorporates these corrections. Model performance is evaluated on a held-out validation set before each deployment to confirm improvement and catch regressions.

Low Level Design: ML Content Moderation Service

Q: How is the human review queue prioritized in a content moderation system?

Items in the review band are ordered by ML score descending so reviewers see the most likely violations first. This minimizes harm during backlog conditions. The SLA for human review is typically 24 hours.

Q: How does an appeal workflow work in content moderation?

When a user appeals a block, an AppealCase is created linked to the original ModerationJob. A secondary reviewer is assigned and may trigger an ML re-score with the latest model version. The reviewer can uphold the block or reverse it to allow. The user is notified of the outcome.

Q: How do reviewer decisions improve ML content moderation models?

Reviewer allow/block decisions on ML-flagged content are collected as labeled training examples. A weekly retraining pipeline incorporates these corrections. Model performance is evaluated on a held-out validation set before each deployment to confirm improvement and catch regressions.

⏱ 3 min read

Overview

An ML-powered content moderation service classifies user-submitted content (text, images, video) at scale, routes borderline cases to human reviewers, and supports an appeal workflow with a feedback loop to improve models over time.

Moderation Pipeline

Content submitted
  --> ML Classifier (score 0.0 - 1.0)
  --> Threshold decision: allow / human-review / block
  --> [if review] Human reviewer: allow / block / escalate
  --> Final action applied + audit log written

Data Model

ModerationJob Table

ModerationJob (
  id            UUID PRIMARY KEY,
  content_id    TEXT NOT NULL,
  content_type  ENUM('text','image','video'),
  ml_scores     JSONB,        -- per-classifier scores
  ml_action     ENUM('allow','review','block'),
  human_action  ENUM('allow','block','escalate') NULLABLE,
  final_action  ENUM('allow','block') NOT NULL,
  reviewer_id   INT NULLABLE,
  reviewed_at   TIMESTAMP NULLABLE,
  appeal_status ENUM('none','pending','resolved') DEFAULT 'none',
  created_at    TIMESTAMP DEFAULT NOW()
)

ML Classifiers

Text Classifiers

Toxicity detector
Spam classifier
Hate speech detector
Adult content classifier

Each is an independent binary classifier returning a score in [0, 1]. The highest score across classifiers drives the threshold decision.

Image Classifiers

NSFW classifier (adult content)
Violence detector
Logo / trademark detector

Score Thresholds

score > 0.95   -- auto-block (high confidence violation)
score 0.30-0.95 -- human review queue
score < 0.30   -- auto-allow (high confidence clean)

Human Review Queue

Items in the review band are inserted into the review queue ordered by score DESC so reviewers see the worst violations first. Each item shows the content, ML scores per classifier, and suggested action.

Reviewer options: allow, block, escalate (to senior reviewer or legal).

SLA: human review completed within 24 hours.

Appeal Workflow

User appeals blocked content
  --> AppealCase created (links to ModerationJob)
  --> Secondary reviewer assigned
  --> Optional ML re-score with latest model
  --> Decision: uphold block / reverse to allow
  --> User notified of outcome

Feedback Loop

Reviewer decisions (allow/block overrides) are collected as labeled training examples. Weekly retraining runs incorporate reviewer corrections. Model performance is tracked via precision/recall on a held-out validation set before each deployment.

Audit Trail

Every decision — ML or human — is logged with:

Actor (model version or reviewer_id)
Action taken
Rationale / scores at decision time
Timestamp

Audit records are immutable and retained per legal/compliance requirements.

Scale and Performance

Auto-classification throughput: 1000+ items/second via batched GPU inference
Human review SLA: 24 hours for borderline content
Async processing: content submitted to SQS → ML worker classifies → result written to ModerationJob → downstream action triggered

Key Design Decisions

Separate binary classifiers per category: independent thresholds, easier to tune and retrain per category
Score-ordered review queue: worst content reviewed first; minimizes harm during backlog
Feedback loop: reviewer corrections continuously improve model quality without manual dataset curation
Appeal workflow: second-level review reduces false positive harm to legitimate users