What is an AI content safety service and what types of content does it filter?

An AI content safety service automatically screens user-generated or model-generated content for policy violations before it is stored or served. It typically filters categories such as hate speech, harassment, violence, self-harm, sexual content, spam, misinformation, and illegal activity. The service exposes an API that accepts text, images, or video and returns category scores and a recommended action.

How does a classifier ensemble work in an AI content safety service?

Multiple specialized models are trained for distinct harm categories and run in parallel against the input. Each model outputs a confidence score for its category. An aggregation layer collects all scores and applies weighted voting or a meta-classifier to produce a combined risk score per category. Using an ensemble improves recall by letting lightweight fast models catch obvious violations while more expensive models handle nuanced edge cases.

How does a policy rule engine determine the action for flagged content?

The rule engine evaluates classifier scores against configurable thresholds per category and per tenant. Rules specify ordered actions such as allow, warn, blur, redact, or block that trigger when a score exceeds a threshold. Rules can incorporate context signals like user trust level, content type, and surface placement to apply different policies—for example a stricter policy for content shown to minors.

How is human review integrated into an AI content safety service?

Content that falls into an uncertain confidence band between the allow and block thresholds is routed to a human review queue instead of being decided automatically. A review tool presents the content and classifier scores to moderators who label the correct action. Those labels feed back into retraining pipelines to shift the model's decision boundary and reduce the fraction of content requiring human review over time.

Low Level Design: AI Content Safety Service

⏱ 8 min read

What Is an AI Content Safety Service?

An AI content safety service filters and classifies text (and other modalities) produced by or sent to AI systems to detect harmful, policy-violating, or legally sensitive content. It provides input screening (prompt injection, jailbreaks, unsafe user inputs) and output screening (hallucinations, toxicity, PII leakage, brand safety violations) as a reusable platform service consumed by multiple product teams.

This is a common low level design question in AI platform and trust & safety engineering interviews. The design covers the classifier ensemble, policy rule engine, human review escalation, and safety metric dashboards.

Requirements

Functional

Input filtering: screen user-submitted text before it reaches an LLM.
Output filtering: screen LLM-generated text before it is shown to users.
Classifier ensemble: run multiple specialized classifiers in parallel.
Policy rule engine: configurable per-tenant rules that map classifier signals to actions.
Actions: allow, block, redact, replace with safe fallback, or escalate to human review.
Human review queue: route ambiguous or high-severity cases to a review interface.
Safety metric dashboards: real-time and historical safety signal reporting.

Non-Functional

Latency: synchronous screening < 100 ms p99 for lightweight classifiers; asynchronous path for heavy models.
Availability: 99.9% uptime; degraded-mode allow/log on classifier failure.
Auditability: immutable log of every screening decision.
Multi-tenancy: per-tenant policy configuration and data isolation.

High-Level Architecture

Client (LLM Gateway / App)
  |
  v
[Safety Service API]
  |
  |-- [Input Screening Pipeline]
  |       |-- Rule-based pre-filter (regex, blocklist)
  |       |-- Classifier Ensemble (parallel)
  |       |      |-- Toxicity Classifier
  |       |      |-- PII Detector
  |       |      |-- Prompt Injection Detector
  |       |      |-- Jailbreak Detector
  |       |-- Policy Rule Engine
  |       |-- Action Executor (allow/block/redact/escalate)
  |
  |-- [Output Screening Pipeline]
  |       |-- Toxicity Classifier
  |       |-- PII Detector
  |       |-- Hallucination Detector
  |       |-- Brand Safety Classifier
  |       |-- Policy Rule Engine
  |       |-- Action Executor
  |
  |-- [Human Review Queue]
  |-- [Audit Logger]
  |-- [Metrics Pipeline]

Core Data Models

ScreeningRequest

{
  request_id: UUID,
  tenant_id: string,
  direction: enum('input', 'output'),
  content: string,
  content_type: enum('text', 'image', 'audio'),
  user_id: string,
  session_id: string,
  context: map,           // e.g. original user prompt for output screening
  policy_set_id: string,
  created_at: timestamp
}

ClassifierResult

{
  classifier_id: string,
  category: string,       // e.g. "toxicity", "pii", "jailbreak"
  score: float,           // 0.0 - 1.0
  label: string,          // e.g. "hate_speech", "phone_number"
  spans: [{start, end}],  // character offsets for detected segments
  latency_ms: int
}

ScreeningDecision

{
  request_id: UUID,
  action: enum('allow', 'block', 'redact', 'replace', 'escalate'),
  triggered_rules: [rule_id],
  classifier_results: [ClassifierResult],
  modified_content: string,   // populated for redact/replace actions
  review_queue_id: string,    // populated for escalate action
  decision_latency_ms: int,
  created_at: timestamp
}

PolicyRule

{
  rule_id: UUID,
  tenant_id: string,
  name: string,
  direction: enum('input', 'output', 'both'),
  condition: {
    classifier: string,
    category: string,
    operator: enum('gt', 'gte', 'eq'),
    threshold: float
  },
  action: enum('allow', 'block', 'redact', 'replace', 'escalate'),
  priority: int,
  enabled: bool
}

Classifier Ensemble

Classifiers are run in parallel to minimize total latency. Each classifier is a microservice with its own scaling policy:

Toxicity classifier: fine-tuned transformer (e.g., detoxify, Perspective API) scoring hate speech, sexual content, violence, self-harm. Returns per-category score.
PII detector: NER-based model plus regex patterns (Microsoft Presidio, AWS Comprehend) detecting names, emails, phone numbers, SSNs, credit card numbers. Returns span offsets for redaction.
Prompt injection detector: classifier trained on injection attack patterns (instruction override, ignore-previous-instructions variants). Returns binary label + confidence.
Jailbreak detector: classifier for known jailbreak templates plus embedding similarity against a curated jailbreak library.
Hallucination detector (output only): NLI-based fact consistency check comparing output claims against retrieved source context. Requires source documents as input.
Brand safety classifier: topic classifier flagging content incompatible with brand guidelines (competitor mentions, off-topic domains).

Classifiers expose a common gRPC interface:

rpc Classify(ClassifyRequest) returns (ClassifyResponse);

message ClassifyRequest {
  string content = 1;
  repeated string categories = 2;  // optional filter
}

message ClassifyResponse {
  repeated ClassifierResult results = 1;
}

The ensemble coordinator fans out to all relevant classifiers using async concurrent calls and collects results within a deadline (e.g., 80 ms). Classifiers that miss the deadline are skipped with a logged timeout; the policy engine treats missing scores as unknown (not safe).

Policy Rule Engine

The rule engine evaluates PolicyRules in priority order against the collected ClassifierResults to determine the final action.

Load active rules for tenant_id (cached in Redis with TTL 60 s, invalidated on rule update).
Sort rules by priority (lower number = higher priority).
Evaluate conditions: compare classifier score against threshold.
First matching rule wins; its action is the output.
If no rule matches, apply the tenant default policy (allow or block).

Compound conditions (AND/OR across classifiers) are supported via a simple expression tree stored in the rule condition JSON. The engine evaluates the tree recursively.

Rule changes take effect within one cache TTL (60 s) without service restart.

Action Execution

Allow: return content unchanged.
Block: return a blocked response to the caller; do not forward to LLM or user.
Redact: replace detected PII spans with type-specific placeholders (e.g., [PHONE_NUMBER], [EMAIL]). Uses span offsets from PII detector.
Replace: substitute the entire content with a configured safe fallback message.
Escalate: write to human review queue; for input screening, hold the request (synchronous) or allow with flag (asynchronous) depending on tenant config.

Human Review Queue

Escalated items are written to a review queue backed by a task management system (internal tool or commercial solution).

{
  review_id: UUID,
  request_id: UUID,
  tenant_id: string,
  content: string,
  classifier_results: [...],
  triggered_rules: [...],
  severity: enum('low', 'medium', 'high', 'critical'),
  status: enum('pending', 'in_review', 'resolved'),
  reviewer_id: string,
  resolution: enum('allow', 'block', 'escalate_legal'),
  resolution_notes: string,
  resolved_at: timestamp
}

Review UI presents the content, classifier signals, and rule triggers. Reviewer decisions feed back into a training data pipeline to improve classifier accuracy over time (active learning loop).

SLA enforcement: critical items are paged to on-call reviewer within 5 minutes; high items within 1 hour.

Synchronous vs. Asynchronous Screening

Synchronous (blocking): caller waits for screening decision before proceeding. Required for input screening where you must prevent unsafe prompts from reaching the LLM. Enforces strict latency budget.
Asynchronous (non-blocking): caller proceeds; screening runs in background. Used for output logging, compliance auditing, or low-risk post-hoc review where latency matters more than strict enforcement.
Hybrid: run fast rule-based pre-filter synchronously; run heavyweight ML classifiers asynchronously and take action if violation is detected after the fact (retroactive block or user notification).

Safety Metric Dashboards

Key metrics surfaced in the dashboard:

Volume: requests screened per minute, breakdown by direction (input/output) and tenant.
Violation rate: fraction of requests triggering each category (toxicity, PII, jailbreak) over time.
Action distribution: allow vs. block vs. redact vs. escalate rates.
Classifier latency: p50/p95/p99 per classifier; timeout rate.
Review queue health: queue depth, median time to resolution, SLA breach rate.
False positive tracking: rate of reviewer overrides (allow decisions on escalated items) used to tune thresholds.

Metrics are emitted to a time-series store (Prometheus + Grafana or Datadog). High violation rate triggers automated alerts. Tenant-facing safety reports are generated weekly from ClickHouse aggregations.

Failure Modes and Degraded Operation

Classifier service unavailable: apply fail-open (allow + log) or fail-closed (block) per tenant config. Emit an alert.
Policy engine config unavailable (Redis down): fall back to in-memory cached rules; emit an alert.
Review queue full: auto-promote critical items to PagerDuty; lower-severity items are logged and batch-reviewed.
Latency spike: fast-path rule-based filter always runs; ML classifiers are skipped beyond deadline rather than blocking the user.

Common Interview Follow-Ups

How do you keep classifiers up to date with new attack patterns? Weekly re-training pipeline using production escalation data; blue-green classifier deployment with shadow mode evaluation before cutover.
How do you handle multilingual content? Language detection first; route to language-specific classifiers or use multilingual models (mBERT, XLM-R).
How do you prevent bypasses using encoded or obfuscated text? Normalizer pre-processor (unicode normalization, leetspeak decoder, HTML entity decode) runs before classifiers.
How do you measure recall vs. precision tradeoffs? Offline evaluation on labeled datasets; adjust classifier thresholds using precision-recall curves; monitor false positive rate via reviewer override rate in production.