Content Classifier Low-Level Design: Multi-Label Classification, Ensemble Models, and Human Review

Content Classifier System Design Overview

A content classifier assigns one or more category labels to user-generated content: text posts, images, videos, and documents. Production content classifiers must handle multi-label outputs (a post can be both spam and adult content), combine signals from multiple models via ensemble voting, route uncertain predictions to human reviewers, and feed reviewer decisions back into model retraining. Getting this pipeline right is critical for platform safety and content quality.

Requirements

Functional Requirements

Classify content across a configurable label taxonomy (spam, adult, violence, hate speech, misinformation, safe).
Support multi-label classification: a single item may receive multiple labels simultaneously.
Combine predictions from text, image, and metadata models using ensemble voting.
Route items with prediction confidence below a configurable threshold to a human review queue.
Accept reviewer verdicts and store them for model retraining and audit.

Non-Functional Requirements

Classification latency under 500ms for synchronous calls on text content.
Throughput of 10,000 classification requests per second at peak.
Human review queue SLA: items reviewed within 4 hours of escalation.
Ensemble model updates deployable without service restart.

Data Model

classification_requests: request_id, content_id, content_type, content_hash, submitted_at, status
model_predictions: request_id, model_id, label, score, model_version, predicted_at
ensemble_decisions: request_id, label, ensemble_score, confidence, action (allow, block, escalate), decided_at
review_queue_items: item_id, request_id, labels_under_review, assigned_reviewer_id, escalated_at, reviewed_at, verdict, reviewer_notes
training_examples: content_id, label, source (model or human), confidence, created_at

Core Algorithms

Multi-Label Classification

Each model in the ensemble is a binary classifier per label, outputting a score in [0, 1] for that label independently. The multi-label decision is made by applying a per-label threshold to each score independently, so a piece of content can receive any combination of labels. Per-label thresholds are tuned on a held-out validation set to hit a target precision of 0.95 for blocking actions, accepting lower recall to minimize false positives that would incorrectly penalize legitimate content.

Ensemble Voting

Three model types contribute predictions: a fine-tuned BERT-based text classifier, a ResNet-based image classifier (for posts with media), and a structured metadata classifier (account age, post frequency, link domains). Ensemble voting uses a weighted average of per-model scores for each label: ensemble_score = sum(w_i * score_i) / sum(w_i). Weights are maintained in a configuration store and updated quarterly based on each models offline F1 score on the validation set. If an image model is unavailable (text-only content), its weight is redistributed proportionally among the remaining models.

Confidence-Gated Human Review

After ensemble scoring, each label is classified into one of three disposition zones by comparing ensemble_score to two thresholds: auto_block_threshold (default 0.90) and review_threshold (default 0.60). Scores above auto_block_threshold trigger immediate blocking. Scores between review_threshold and auto_block_threshold escalate to the human review queue. Scores below review_threshold are auto-approved. The dual-threshold design keeps the human review queue manageable: only genuinely ambiguous predictions require human judgment, while clear violations and clear approvals are handled automatically.

Feedback Loop

Human reviewer verdicts are written to the training_examples table as high-confidence labeled examples (confidence = 1.0). A nightly retraining pipeline (Apache Airflow DAG) selects the last 30 days of human-reviewed examples, samples an equal volume of auto-approved examples, and fine-tunes each model on the combined dataset. Models are evaluated offline against a fixed benchmark dataset; models that improve F1 by more than 0.5% are promoted to the canary slot and receive 10% of traffic for 24 hours before full promotion.

Scalability

Classification requests are served synchronously for latency-sensitive callers (post submission flow) and asynchronously via Kafka for batch backfill jobs. The ensemble orchestrator calls each model microservice in parallel using async HTTP with a 300ms timeout per model. Models that time out are excluded from the ensemble for that request, and the remaining models vote with redistributed weights. This prevents a slow model from blocking the entire classification decision.

The human review queue is backed by Postgres with a partial index on (status = pending, escalated_at ASC) to efficiently serve the oldest-first assignment query used by the reviewer dashboard.

API Design

POST /v1/classify

Body: content_id, content_type (text, image, video, mixed), text (optional), media_url (optional), metadata (JSON)
Response: request_id, decisions array (label, ensemble_score, action), reviewed_by_human: false

POST /v1/review/{item_id}/verdict

Body: verdict (allow or block), labels (array of confirmed labels), reviewer_notes
Response: item_id, updated_action, training_example_id
Auth: reviewer role required

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does multi-label classification differ from single-label in a content classifier?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In single-label classification the model picks one class per input; in multi-label classification each input can belong to multiple classes simultaneously (e.g., a post can be 'spam' and 'adult content' at the same time). The output layer uses independent sigmoid activations instead of a softmax, producing a probability per label. Thresholds are tuned independently for each label based on desired precision/recall trade-offs.”
}
},
{
“@type”: “Question”,
“name”: “How does ensemble voting improve classification accuracy?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An ensemble runs multiple models (e.g., a fine-tuned transformer, a gradient-boosted tree on hand-crafted features, and a rule-based classifier) and combines their outputs. Hard voting takes the majority label; soft voting averages predicted probabilities and picks the highest. Stacking trains a meta-learner on the outputs of base models. Ensembles reduce variance and catch cases where a single model is miscalibrated, improving robustness across content domains.”
}
},
{
“@type”: “Question”,
“name”: “When and how should a content classifier route items to human review?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Items with model confidence in a configured ambiguous band (e.g., 0.40–0.75 for a sensitive label) are enqueued for human review rather than auto-actioned. A moderation queue prioritizes items by severity score and time-in-queue SLA. Human decisions are written back to a labeled dataset. A circuit-breaker monitors queue depth; if it exceeds capacity, the confidence threshold is tightened to reduce inflow until the backlog clears.”
}
},
{
“@type”: “Question”,
“name”: “How does a feedback loop improve a content classifier over time?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Human review decisions and user appeals that overturn automated labels are collected as ground-truth corrections. These are merged into the training dataset and trigger a scheduled retraining pipeline (e.g., weekly). A shadow deployment evaluates the new model against the current one on a held-out slice of recent traffic before promotion. Continuous evaluation dashboards track label drift and model degradation, triggering retraining automatically when F1 drops below a threshold.”
}
}
]
}