ML System Design: Build a Spam Classifier

“Design a spam classifier” is one of the most common ML system design questions at Google, Meta, and Microsoft. Unlike pure algorithm questions, this tests your ability to scope a complete ML system: data collection, feature engineering, model selection, serving architecture, feedback loops, and adversarial robustness.

Step 1: Clarify Requirements

Before jumping to models, ask:

  • What is spam? Email spam, SMS, social media posts, comments, reviews, ads? The definition shapes features and labels entirely.
  • What’s the precision/recall trade-off? False positives (legitimate email marked spam) are worse than false negatives for most users. What’s acceptable?
  • What’s the latency requirement? Email classification can tolerate 500ms; SMS must be real-time (<50ms).
  • Volume? Gmail processes 300 billion emails/day. This shapes serving infrastructure.
  • Languages and domains? English-only vs multilingual determines tokenization and embedding choices.

Reasonable assumptions: email spam classifier, 100M emails/day, 50ms P99 latency, English+Spanish, false positive rate must be <0.1%.

Step 2: Data Collection and Labeling

Sources of labeled data:

  • User feedback: “Mark as spam” / “Not spam” buttons — high quality but sparse, biased toward visible spam
  • Honeypot accounts: email addresses published online to attract spam; all received mail is labeled spam
  • Manual review queue: internal team labels borderline cases
  • Third-party datasets: SpamAssassin, Enron corpus (use with care — 2000s spam patterns)

Label quality issues:

  • User disagreement: marketing email labeled spam by some, legitimate by others
  • Temporal staleness: spam patterns evolve; labels from 6 months ago may be misleading
  • Selection bias: users rarely mark missed spam; you only see what they report

Step 3: Feature Engineering

Heuristic signals (fast, interpretable, high precision for obvious spam):

  • Sender domain reputation score (queried from reputation DB)
  • SPF/DKIM/DMARC authentication pass/fail
  • Sender’s historical spam rate from this account
  • Reply-to domain != From domain
  • Number of recipients (bulk sending pattern)
  • URL count; presence of known malicious domains
  • HTML-to-text ratio (spam often has excessive HTML)

Content features (for ML model):

  • TF-IDF bag-of-words on subject + body
  • Character n-grams (catch obfuscation: “V1agra”, “fr3e”)
  • Subject line features: ALL CAPS ratio, excessive punctuation, urgency words
  • Embedding from pre-trained model (BERT, or domain-fine-tuned model)

Behavioral signals:

  • Open rate for this sender across all recipients
  • Reply rate, unsubscribe rate
  • Graph features: has the sender interacted with this recipient before?

Step 4: Model Selection

Two-stage architecture (industry standard):

Stage 1 — Rule-based pre-filter (blocks ~60-70% of obvious spam):

  • Known spam IP blocklist
  • DNS-based blocklist (DNSBL) lookup
  • SpamAssassin-style scoring rules
  • Handles ~60% of volume with no ML cost; very low latency

Stage 2 — ML classifier (for borderline cases):

Model Pros Cons
Naive Bayes Extremely fast, interpretable, handles high-dimensional text well Independence assumption violated in practice
Logistic Regression + TF-IDF Fast, sparse, good baseline, interpretable coefficients Misses semantic meaning, no cross-feature interactions
Gradient Boosted Trees (LightGBM) Handles mixed features (text + behavioral + metadata), fast serving No direct text sequence modeling
BERT fine-tuned Best accuracy, understands context and obfuscation High latency (100-400ms), expensive to serve

Recommended stack for 50ms P99:
LightGBM on TF-IDF + heuristic features for 95% of traffic. Route low-confidence predictions to a distilled BERT model. Cache sender reputation scores.

Step 5: Evaluation Framework

from sklearn.metrics import precision_recall_curve, roc_auc_score
import numpy as np

def evaluate_spam_classifier(y_true, y_scores, false_positive_budget=0.001):
    """
    For spam classification, we typically operate at a fixed false positive rate.
    Find the threshold that maximizes recall while keeping FPR <= budget.
    """
    from sklearn.metrics import roc_curve
    fpr, tpr, thresholds = roc_curve(y_true, y_scores)

    # Find highest threshold where FPR <= budget
    valid_mask = fpr = optimal_threshold).astype(int)

    from sklearn.metrics import classification_report
    print(f"Operating threshold: {optimal_threshold:.4f}")
    print(f"False Positive Rate: {fpr[best_idx]:.4f}")
    print(f"True Positive Rate (Recall): {tpr[best_idx]:.4f}")
    print(f"AUC-ROC: {roc_auc_score(y_true, y_scores):.4f}")
    print(classification_report(y_true, y_pred, target_names=['Ham', 'Spam']))

    return optimal_threshold

Step 6: Serving Architecture

User sends email
       ↓
DNS/IP Blocklist Check (< 1ms)  ──→ Block immediately if in blocklist
       ↓
Reputation Service (async lookup, ~2ms)
       ↓
Feature Extraction Service
  - Parse email headers/body
  - TF-IDF vectorization
  - Behavioral feature lookup (Redis cache)
       ↓
LightGBM Inference (< 5ms)
  score  0.8: → Spam folder
  0.2-0.8:    → BERT re-score (< 40ms)
       ↓
Feedback collector logs decision
       ↓
Async: update sender reputation, log for retraining

Step 7: Adversarial Robustness

Spammers adapt. Your defense:

  • Text obfuscation: “V!agra” → character n-grams catch this better than word-level features
  • Image spam: embed spam text in images — requires OCR pipeline or image classification
  • Adversarial examples: adding innocuous words to fool classifiers — monitor for distribution shift in features that change without changing semantics
  • Account hijacking: use trusted accounts to send spam — behavioral signals (sudden change in volume/recipients) are key

Step 8: Continuous Learning

  • Retrain weekly on sliding window of recent labeled data + fixed sample of historical data
  • A/B test new model vs. current champion on 5% of traffic before full rollout
  • Shadow mode: run new model alongside current; compare decisions before switching
  • Monitor for false positive regression — new model must not increase legitimate email blocked rate

Depth Levels

Junior: Describe features you’d use and choose a model. Discuss precision/recall trade-off.

Senior: Design two-stage pipeline, discuss serving latency, describe retraining loop.

Staff: Handle adversarial robustness, multilingual spam, user-level personalization (different spam thresholds per user), and regulatory constraints (GDPR for behavioral signal storage).

Related ML Topics

  • NLP Interview Questions — TF-IDF, BERT fine-tuning, and tokenization trade-offs all appear in spam classifier design; BPE handles obfuscation better than word-level tokenization
  • Handling Imbalanced Datasets — spam is typically 1-5 percent of email volume; scale_pos_weight, SMOTE, and focal loss are all applicable
  • Classification Metrics — spam classifiers operate at fixed false positive rate; precision/recall at threshold and AUC-ROC are the primary evaluation metrics
  • How to Detect Model Drift in Production — spammer adaptation is a form of concept drift; prediction score distribution monitoring catches it early
  • ML System Design: Build a Fraud Detection System — spam and fraud detection share the same architecture patterns: rule-based pre-filter, ML scorer, adversarial adaptation challenges
Scroll to Top