Question 1

How do you implement velocity checks in a spam detection system, and what data structures support them?

Accepted Answer

Velocity checks count how many times an entity (user, IP, device fingerprint, email domain) performs an action within a sliding time window. The canonical implementation uses a Redis sorted set where each event is stored as a member with its timestamp as the score. To check velocity, you ZRANGEBYSCORE the last N seconds and count results; to prune, you remove entries older than the window. For high-throughput scenarios, a Count-Min Sketch provides approximate counts with configurable error bounds using O(1) space per entity. Typical thresholds: more than 50 messages/hour from a new account, more than 10 failed login attempts/minute from an IP, or more than 100 email sends/day from a domain. Thresholds are tiered by account age and historical reputation to reduce false positives on legitimate power users.

Question 2

How does graph-based spam detection work, and what graph algorithms are most effective?

Accepted Answer

Graph-based detection models the social or communication network as a graph where nodes are accounts and edges represent interactions (messages, follows, transactions). Spam rings appear as dense subgraphs with high internal connectivity and low external connectivity. Effective algorithms include: Label Propagation, which iterates spam labels from known bad nodes through the graph and is scalable to billions of edges; Louvain or METIS for community detection to surface suspicious clusters; and Belief Propagation on factor graphs, which is used in production at Facebook and models the joint probability of spam across connected accounts. Features derived from graph structure — such as in/out degree ratio, clustering coefficient, and betweenness centrality — are fed as features to a downstream classifier. The graph is typically stored in a distributed graph engine (Apache Giraph, GraphX, or a specialized system like LinkedIn's Expander).

Question 3

How do you design a classifier ensemble for spam detection that balances precision and recall?

Accepted Answer

A classifier ensemble combines multiple models — a fast rule-based filter, a gradient-boosted tree (XGBoost or LightGBM) on hand-crafted features, and a neural text classifier (fine-tuned BERT or FastText) — using a stacking or voting approach. The rule-based layer handles obvious patterns (known bad domains, blocklisted phrases) with near-zero latency. The GBDT model scores on behavioral and account features. The neural model analyzes message content for semantic spam patterns. A meta-learner (logistic regression) combines scores from all three. Threshold tuning is critical: spam detection typically operates at high recall (> 95%) with precision as a secondary constraint, since false negatives (missed spam) cause more user harm than false positives (blocked legitimate messages). Calibrated probability outputs enable threshold adjustment without retraining. The ensemble is retrained weekly on fresh labeled data to adapt to evolving spam tactics.

Question 4

How do you architect a spam detection system to process millions of messages per second with low latency?

Accepted Answer

The architecture uses a tiered, asynchronous pipeline. A synchronous inline check (< 10ms) runs velocity checks via Redis and a fast rule engine before the message is accepted. Accepted messages are published to a Kafka topic and processed by a Flink streaming job that computes richer features (graph signals, content embeddings) and scores the message with the full ensemble model within 1-2 seconds. If the score exceeds a threshold, the message is flagged and a compensating action is triggered (suppress delivery, queue for human review). Model serving uses batched inference with GPU acceleration to handle throughput spikes. Feature stores are sharded by user ID and replicated for read availability. The system is designed for eventual consistency: some spam may be delivered before detection, but downstream enforcement (retroactive removal, account suspension) corrects this within the SLA window.

Spam Detection System Low-Level Design: Velocity Checks, Graph-Based Detection, and Classifier Ensemble

Spam Detection System Overview

Spam Signal Categories

Rule-Based Velocity Checks

Text Classifier

Graph-Based Detection

Ensemble Scoring

Real-Time Enforcement

Feedback Loop from User Reports

IP and Device Fingerprint Signals

Hash-Based Deduplication

False Positive Mitigation