ML System Design: Build a Spam Classifier

⏱ 4 min read

“Design a spam classifier” is one of the most common ML system design questions at Google, Meta, and Microsoft. Unlike pure algorithm questions, this tests your ability to scope a complete ML system: data collection, feature engineering, model selection, serving architecture, feedback loops, and adversarial robustness.

Step 1: Clarify Requirements

Before jumping to models, ask:

What is spam? Email spam, SMS, social media posts, comments, reviews, ads? The definition shapes features and labels entirely.
What’s the precision/recall trade-off? False positives (legitimate email marked spam) are worse than false negatives for most users. What’s acceptable?
What’s the latency requirement? Email classification can tolerate 500ms; SMS must be real-time (<50ms).
Volume? Gmail processes 300 billion emails/day. This shapes serving infrastructure.
Languages and domains? English-only vs multilingual determines tokenization and embedding choices.

Reasonable assumptions: email spam classifier, 100M emails/day, 50ms P99 latency, English+Spanish, false positive rate must be <0.1%.

Step 2: Data Collection and Labeling

Sources of labeled data:

User feedback: “Mark as spam” / “Not spam” buttons — high quality but sparse, biased toward visible spam
Honeypot accounts: email addresses published online to attract spam; all received mail is labeled spam
Manual review queue: internal team labels borderline cases
Third-party datasets: SpamAssassin, Enron corpus (use with care — 2000s spam patterns)

Label quality issues:

User disagreement: marketing email labeled spam by some, legitimate by others
Temporal staleness: spam patterns evolve; labels from 6 months ago may be misleading
Selection bias: users rarely mark missed spam; you only see what they report

Step 3: Feature Engineering

Heuristic signals (fast, interpretable, high precision for obvious spam):

Sender domain reputation score (queried from reputation DB)
SPF/DKIM/DMARC authentication pass/fail
Sender’s historical spam rate from this account
Reply-to domain != From domain
Number of recipients (bulk sending pattern)
URL count; presence of known malicious domains
HTML-to-text ratio (spam often has excessive HTML)

Content features (for ML model):

TF-IDF bag-of-words on subject + body
Character n-grams (catch obfuscation: “V1agra”, “fr3e”)
Subject line features: ALL CAPS ratio, excessive punctuation, urgency words
Embedding from pre-trained model (BERT, or domain-fine-tuned model)

Behavioral signals:

Open rate for this sender across all recipients
Reply rate, unsubscribe rate
Graph features: has the sender interacted with this recipient before?

Step 4: Model Selection

Two-stage architecture (industry standard):

Stage 1 — Rule-based pre-filter (blocks ~60-70% of obvious spam):

Known spam IP blocklist
DNS-based blocklist (DNSBL) lookup
SpamAssassin-style scoring rules
Handles ~60% of volume with no ML cost; very low latency

Stage 2 — ML classifier (for borderline cases):

Model	Pros	Cons
Naive Bayes	Extremely fast, interpretable, handles high-dimensional text well	Independence assumption violated in practice
Logistic Regression + TF-IDF	Fast, sparse, good baseline, interpretable coefficients	Misses semantic meaning, no cross-feature interactions
Gradient Boosted Trees (LightGBM)	Handles mixed features (text + behavioral + metadata), fast serving	No direct text sequence modeling
BERT fine-tuned	Best accuracy, understands context and obfuscation	High latency (100-400ms), expensive to serve

Recommended stack for 50ms P99:
LightGBM on TF-IDF + heuristic features for 95% of traffic. Route low-confidence predictions to a distilled BERT model. Cache sender reputation scores.

Step 5: Evaluation Framework

from sklearn.metrics import precision_recall_curve, roc_auc_score
import numpy as np

def evaluate_spam_classifier(y_true, y_scores, false_positive_budget=0.001):
    """
    For spam classification, we typically operate at a fixed false positive rate.
    Find the threshold that maximizes recall while keeping FPR <= budget.
    """
    from sklearn.metrics import roc_curve
    fpr, tpr, thresholds = roc_curve(y_true, y_scores)

    # Find highest threshold where FPR <= budget
    valid_mask = fpr = optimal_threshold).astype(int)

    from sklearn.metrics import classification_report
    print(f"Operating threshold: {optimal_threshold:.4f}")
    print(f"False Positive Rate: {fpr[best_idx]:.4f}")
    print(f"True Positive Rate (Recall): {tpr[best_idx]:.4f}")
    print(f"AUC-ROC: {roc_auc_score(y_true, y_scores):.4f}")
    print(classification_report(y_true, y_pred, target_names=['Ham', 'Spam']))

    return optimal_threshold

Step 6: Serving Architecture

User sends email
       ↓
DNS/IP Blocklist Check (< 1ms)  ──→ Block immediately if in blocklist
       ↓
Reputation Service (async lookup, ~2ms)
       ↓
Feature Extraction Service
  - Parse email headers/body
  - TF-IDF vectorization
  - Behavioral feature lookup (Redis cache)
       ↓
LightGBM Inference (< 5ms)
  score  0.8: → Spam folder
  0.2-0.8:    → BERT re-score (< 40ms)
       ↓
Feedback collector logs decision
       ↓
Async: update sender reputation, log for retraining

Step 7: Adversarial Robustness

Spammers adapt. Your defense:

Text obfuscation: “V!agra” → character n-grams catch this better than word-level features
Image spam: embed spam text in images — requires OCR pipeline or image classification
Adversarial examples: adding innocuous words to fool classifiers — monitor for distribution shift in features that change without changing semantics
Account hijacking: use trusted accounts to send spam — behavioral signals (sudden change in volume/recipients) are key

Step 8: Continuous Learning

Retrain weekly on sliding window of recent labeled data + fixed sample of historical data
A/B test new model vs. current champion on 5% of traffic before full rollout
Shadow mode: run new model alongside current; compare decisions before switching
Monitor for false positive regression — new model must not increase legitimate email blocked rate

Depth Levels

Junior: Describe features you’d use and choose a model. Discuss precision/recall trade-off.

Senior: Design two-stage pipeline, discuss serving latency, describe retraining loop.

Staff: Handle adversarial robustness, multilingual spam, user-level personalization (different spam thresholds per user), and regulatory constraints (GDPR for behavioral signal storage).

NLP Interview Questions — TF-IDF, BERT fine-tuning, and tokenization trade-offs all appear in spam classifier design; BPE handles obfuscation better than word-level tokenization
Handling Imbalanced Datasets — spam is typically 1-5 percent of email volume; scale_pos_weight, SMOTE, and focal loss are all applicable
Classification Metrics — spam classifiers operate at fixed false positive rate; precision/recall at threshold and AUC-ROC are the primary evaluation metrics
How to Detect Model Drift in Production — spammer adaptation is a form of concept drift; prediction score distribution monitoring catches it early
ML System Design: Build a Fraud Detection System — spam and fraud detection share the same architecture patterns: rule-based pre-filter, ML scorer, adversarial adaptation challenges