What Is an AI Content Safety Service?
An AI content safety service filters and classifies text (and other modalities) produced by or sent to AI systems to detect harmful, policy-violating, or legally sensitive content. It provides input screening (prompt injection, jailbreaks, unsafe user inputs) and output screening (hallucinations, toxicity, PII leakage, brand safety violations) as a reusable platform service consumed by multiple product teams.
This is a common low level design question in AI platform and trust & safety engineering interviews. The design covers the classifier ensemble, policy rule engine, human review escalation, and safety metric dashboards.
Requirements
Functional
- Input filtering: screen user-submitted text before it reaches an LLM.
- Output filtering: screen LLM-generated text before it is shown to users.
- Classifier ensemble: run multiple specialized classifiers in parallel.
- Policy rule engine: configurable per-tenant rules that map classifier signals to actions.
- Actions: allow, block, redact, replace with safe fallback, or escalate to human review.
- Human review queue: route ambiguous or high-severity cases to a review interface.
- Safety metric dashboards: real-time and historical safety signal reporting.
Non-Functional
- Latency: synchronous screening < 100 ms p99 for lightweight classifiers; asynchronous path for heavy models.
- Availability: 99.9% uptime; degraded-mode allow/log on classifier failure.
- Auditability: immutable log of every screening decision.
- Multi-tenancy: per-tenant policy configuration and data isolation.
High-Level Architecture
Client (LLM Gateway / App) | v [Safety Service API] | |-- [Input Screening Pipeline] | |-- Rule-based pre-filter (regex, blocklist) | |-- Classifier Ensemble (parallel) | | |-- Toxicity Classifier | | |-- PII Detector | | |-- Prompt Injection Detector | | |-- Jailbreak Detector | |-- Policy Rule Engine | |-- Action Executor (allow/block/redact/escalate) | |-- [Output Screening Pipeline] | |-- Toxicity Classifier | |-- PII Detector | |-- Hallucination Detector | |-- Brand Safety Classifier | |-- Policy Rule Engine | |-- Action Executor | |-- [Human Review Queue] |-- [Audit Logger] |-- [Metrics Pipeline]
Core Data Models
ScreeningRequest
{
request_id: UUID,
tenant_id: string,
direction: enum('input', 'output'),
content: string,
content_type: enum('text', 'image', 'audio'),
user_id: string,
session_id: string,
context: map, // e.g. original user prompt for output screening
policy_set_id: string,
created_at: timestamp
}
ClassifierResult
{
classifier_id: string,
category: string, // e.g. "toxicity", "pii", "jailbreak"
score: float, // 0.0 - 1.0
label: string, // e.g. "hate_speech", "phone_number"
spans: [{start, end}], // character offsets for detected segments
latency_ms: int
}
ScreeningDecision
{
request_id: UUID,
action: enum('allow', 'block', 'redact', 'replace', 'escalate'),
triggered_rules: [rule_id],
classifier_results: [ClassifierResult],
modified_content: string, // populated for redact/replace actions
review_queue_id: string, // populated for escalate action
decision_latency_ms: int,
created_at: timestamp
}
PolicyRule
{
rule_id: UUID,
tenant_id: string,
name: string,
direction: enum('input', 'output', 'both'),
condition: {
classifier: string,
category: string,
operator: enum('gt', 'gte', 'eq'),
threshold: float
},
action: enum('allow', 'block', 'redact', 'replace', 'escalate'),
priority: int,
enabled: bool
}
Classifier Ensemble
Classifiers are run in parallel to minimize total latency. Each classifier is a microservice with its own scaling policy:
- Toxicity classifier: fine-tuned transformer (e.g., detoxify, Perspective API) scoring hate speech, sexual content, violence, self-harm. Returns per-category score.
- PII detector: NER-based model plus regex patterns (Microsoft Presidio, AWS Comprehend) detecting names, emails, phone numbers, SSNs, credit card numbers. Returns span offsets for redaction.
- Prompt injection detector: classifier trained on injection attack patterns (instruction override, ignore-previous-instructions variants). Returns binary label + confidence.
- Jailbreak detector: classifier for known jailbreak templates plus embedding similarity against a curated jailbreak library.
- Hallucination detector (output only): NLI-based fact consistency check comparing output claims against retrieved source context. Requires source documents as input.
- Brand safety classifier: topic classifier flagging content incompatible with brand guidelines (competitor mentions, off-topic domains).
Classifiers expose a common gRPC interface:
rpc Classify(ClassifyRequest) returns (ClassifyResponse);
message ClassifyRequest {
string content = 1;
repeated string categories = 2; // optional filter
}
message ClassifyResponse {
repeated ClassifierResult results = 1;
}
The ensemble coordinator fans out to all relevant classifiers using async concurrent calls and collects results within a deadline (e.g., 80 ms). Classifiers that miss the deadline are skipped with a logged timeout; the policy engine treats missing scores as unknown (not safe).
Policy Rule Engine
The rule engine evaluates PolicyRules in priority order against the collected ClassifierResults to determine the final action.
- Load active rules for tenant_id (cached in Redis with TTL 60 s, invalidated on rule update).
- Sort rules by priority (lower number = higher priority).
- Evaluate conditions: compare classifier score against threshold.
- First matching rule wins; its action is the output.
- If no rule matches, apply the tenant default policy (allow or block).
Compound conditions (AND/OR across classifiers) are supported via a simple expression tree stored in the rule condition JSON. The engine evaluates the tree recursively.
Rule changes take effect within one cache TTL (60 s) without service restart.
Action Execution
- Allow: return content unchanged.
- Block: return a blocked response to the caller; do not forward to LLM or user.
- Redact: replace detected PII spans with type-specific placeholders (e.g., [PHONE_NUMBER], [EMAIL]). Uses span offsets from PII detector.
- Replace: substitute the entire content with a configured safe fallback message.
- Escalate: write to human review queue; for input screening, hold the request (synchronous) or allow with flag (asynchronous) depending on tenant config.
Human Review Queue
Escalated items are written to a review queue backed by a task management system (internal tool or commercial solution).
{
review_id: UUID,
request_id: UUID,
tenant_id: string,
content: string,
classifier_results: [...],
triggered_rules: [...],
severity: enum('low', 'medium', 'high', 'critical'),
status: enum('pending', 'in_review', 'resolved'),
reviewer_id: string,
resolution: enum('allow', 'block', 'escalate_legal'),
resolution_notes: string,
resolved_at: timestamp
}
Review UI presents the content, classifier signals, and rule triggers. Reviewer decisions feed back into a training data pipeline to improve classifier accuracy over time (active learning loop).
SLA enforcement: critical items are paged to on-call reviewer within 5 minutes; high items within 1 hour.
Synchronous vs. Asynchronous Screening
- Synchronous (blocking): caller waits for screening decision before proceeding. Required for input screening where you must prevent unsafe prompts from reaching the LLM. Enforces strict latency budget.
- Asynchronous (non-blocking): caller proceeds; screening runs in background. Used for output logging, compliance auditing, or low-risk post-hoc review where latency matters more than strict enforcement.
- Hybrid: run fast rule-based pre-filter synchronously; run heavyweight ML classifiers asynchronously and take action if violation is detected after the fact (retroactive block or user notification).
Safety Metric Dashboards
Key metrics surfaced in the dashboard:
- Volume: requests screened per minute, breakdown by direction (input/output) and tenant.
- Violation rate: fraction of requests triggering each category (toxicity, PII, jailbreak) over time.
- Action distribution: allow vs. block vs. redact vs. escalate rates.
- Classifier latency: p50/p95/p99 per classifier; timeout rate.
- Review queue health: queue depth, median time to resolution, SLA breach rate.
- False positive tracking: rate of reviewer overrides (allow decisions on escalated items) used to tune thresholds.
Metrics are emitted to a time-series store (Prometheus + Grafana or Datadog). High violation rate triggers automated alerts. Tenant-facing safety reports are generated weekly from ClickHouse aggregations.
Failure Modes and Degraded Operation
- Classifier service unavailable: apply fail-open (allow + log) or fail-closed (block) per tenant config. Emit an alert.
- Policy engine config unavailable (Redis down): fall back to in-memory cached rules; emit an alert.
- Review queue full: auto-promote critical items to PagerDuty; lower-severity items are logged and batch-reviewed.
- Latency spike: fast-path rule-based filter always runs; ML classifiers are skipped beyond deadline rather than blocking the user.
Common Interview Follow-Ups
- How do you keep classifiers up to date with new attack patterns? Weekly re-training pipeline using production escalation data; blue-green classifier deployment with shadow mode evaluation before cutover.
- How do you handle multilingual content? Language detection first; route to language-specific classifiers or use multilingual models (mBERT, XLM-R).
- How do you prevent bypasses using encoded or obfuscated text? Normalizer pre-processor (unicode normalization, leetspeak decoder, HTML entity decode) runs before classifiers.
- How do you measure recall vs. precision tradeoffs? Offline evaluation on labeled datasets; adjust classifier thresholds using precision-recall curves; monitor false positive rate via reviewer override rate in production.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering