System Design Interview: Fraud Detection System

Payment fraud costs $32 billion annually. Every payment processor, bank, and marketplace needs a fraud detection system that identifies fraudulent transactions in real time (sub-100ms) without blocking too many legitimate transactions. This is a common system design question at Stripe, PayPal, Square, Robinhood, Coinbase, and any fintech company.

Requirements

Functional: Score each transaction for fraud risk in real time. Block high-risk transactions. Apply step-up authentication (2FA challenge) for medium-risk. Log all decisions and features for model retraining. Investigate flagged transactions. Dispute and chargeback management.

Non-functional: Scoring latency <100ms (must not delay payment authorization). False positive rate <0.1% (blocking legitimate transactions costs the business). False negative rate minimized (each fraudulent transaction costs money). Scale: 100K transactions per second on peak.

Multi-Layer Defense

No single technique catches all fraud. Production fraud detection uses multiple layers:

  1. Hard rules (blocklist): known fraud cards, IPs on blocklist, sanctioned countries. Applied in <1ms via Redis SET lookups. Block rate: 1-3%.
  2. Velocity rules: limit transactions per user/card/device/IP in a sliding window. “More than 5 transactions in 10 minutes from the same device” is flagged. Applied in <5ms with Redis counters.
  3. ML model score: gradient boosted tree (LightGBM) or neural network scores each transaction on 200+ features. Latency: 20-50ms.
  4. Graph analysis: detect fraud rings by analyzing connections between accounts, devices, and payment methods. Run periodically (not real-time) on historical data.

Feature Engineering

ML models for fraud detection use three categories of features:

Transaction features: amount, merchant category code (MCC), currency, payment method type, is_international, time_of_day, day_of_week

User history features: avg_transaction_amount_7d, max_transaction_amount_30d, num_transactions_1h, num_unique_merchants_7d, days_since_account_creation, previous_chargebacks_count

Context features: is_new_device, device_fingerprint_age, ip_country_matches_billing_country, shipping_address_risk_score, vpn_detected, velocity_on_this_card_1h

Computing aggregated features (7-day averages, 1-hour counts) in real time requires a feature store. Architecture: raw events are written to Kafka → a stream processing job (Flink or Spark Streaming) computes rolling aggregates → aggregates stored in Redis (fast read, <5ms) with TTL per window. On scoring, the fraud service fetches features from Redis in a single batch (pipeline GET for all feature keys).

Real-Time Scoring Pipeline

class FraudScoringService:
    def score(self, transaction: Transaction) -> FraudDecision:
        # Layer 1: blocklist check (< 1ms)
        if self.blocklist.contains(transaction.card_id):
            return FraudDecision(action=BLOCK, reason="BLOCKLIST")

        # Layer 2: velocity rules ( 10:
            return FraudDecision(action=BLOCK, reason="VELOCITY")

        # Layer 3: fetch features from feature store (< 10ms)
        features = self.feature_store.fetch(transaction)

        # Layer 4: ML model scoring ( 0.95:
            return FraudDecision(action=BLOCK, score=score)
        elif score > 0.7:
            return FraudDecision(action=CHALLENGE, score=score)
        else:
            return FraudDecision(action=ALLOW, score=score)

Model Training and Feedback Loop

The training dataset requires labels: which transactions were actually fraudulent. Labels come from two sources: (1) customer disputes and chargebacks (confirmed fraud, but delayed by 30-90 days after the transaction); (2) internal investigations and rule-based labels (faster but less accurate).

Class imbalance: fraud is rare — 0.1% of transactions. Techniques: oversampling (SMOTE), undersampling the majority class, or class-weighted loss functions. LightGBM handles class imbalance well with scale_pos_weight parameter.

Model retraining pipeline: daily batch retrain using the past 90 days of labeled transactions. Shadow mode: new model scores every transaction in parallel with the production model — compare decisions without affecting customers. Champion/challenger: gradually route 10% of traffic to the new model, monitor false positive and false negative rates, promote if better.

Graph Fraud Detection

Fraud rings use multiple coordinated accounts. Individual account-level models miss this pattern. Graph analysis reveals it: build a graph where nodes are accounts, cards, devices, and IPs. Edges connect accounts that shared a device, card, or IP. Fraud rings appear as dense clusters.

Algorithms: community detection (Louvain, Label Propagation) finds clusters. Nodes in the same cluster as known fraud accounts are flagged for review. Graph features (degree, betweenness centrality, cluster membership) are extracted and fed into the ML model as additional features. Graph computation runs nightly on the full transaction graph using Spark GraphX or Apache TinkerPop.

Explainability

Regulators (GDPR, FCRA) require that automated decisions be explainable. When a transaction is blocked, the customer must be told why (in general terms). SHAP (SHapley Additive exPlanations) values decompose the model score into per-feature contributions: “This transaction was flagged because: unusual amount (+0.3), new device (+0.2), international IP (+0.15), late night (+0.1).” SHAP is computed in <5ms for tree-based models, making it feasible in the real-time scoring path.

Frequently Asked Questions

How do you build a real-time fraud detection system that is both fast and accurate?

A production fraud detection system uses multiple layers with increasing computational cost and accuracy. Layer 1 (sub-1ms): Redis-based blocklist and allowlist lookups for known fraud cards, IPs, and devices. Layer 2 (1-5ms): velocity checks using Redis sliding window counters — "more than 5 transactions in 10 minutes from this device." Layer 3 (20-50ms): ML model scoring using pre-computed features from a feature store (Redis caches 7-day averages, 1-hour counts computed by a Flink stream processor). The model (LightGBM or a neural network) produces a fraud probability score. Decisions: score > 0.95 → block; 0.7-0.95 → challenge (send 2FA); < 0.7 → allow. The total pipeline runs in under 100ms, meeting payment authorization latency requirements. Accuracy is maintained by retraining the model daily on labeled data (chargebacks and disputes provide ground truth, with 30-90 day delay).

How do you handle the class imbalance problem in fraud detection?

Fraud is rare — typically 0.01%-0.1% of transactions. Training a model on raw data where 99.9% of examples are "not fraud" produces a model that predicts "not fraud" for everything and achieves 99.9% accuracy while catching zero fraud. Several techniques address this: (1) Class weighting: assign higher loss weight to fraud examples (LightGBM scale_pos_weight parameter). The model penalizes missing fraud more than missing legitimate transactions. (2) Oversampling: SMOTE (Synthetic Minority Oversampling Technique) creates synthetic fraud examples by interpolating between existing fraud examples in feature space. (3) Undersampling: randomly remove majority class examples to balance the training set. Loses information but reduces training time. (4) Threshold tuning: instead of using 0.5 as the decision threshold, tune it to the desired precision/recall tradeoff on a validation set. Lowering the threshold catches more fraud (higher recall) at the cost of more false positives (lower precision). The right threshold depends on the business cost of each type of error.

What is a feature store and why is it important for ML systems?

A feature store is a centralized repository for ML features that solves two problems: (1) Training-serving skew: if features are computed differently during training (batch job) and serving (real-time), the model sees different distributions and performs worse in production than on validation data. A feature store computes features consistently using the same logic for both training and serving. (2) Feature reuse: multiple ML models (fraud detection, recommendation, credit scoring) often need the same features (user transaction history, account age). A feature store computes each feature once and makes it available to all models, avoiding redundant computation. Architecture: raw events are written to Kafka; a stream processor (Flink) computes real-time aggregates and writes to an online store (Redis, DynamoDB) for low-latency serving; a batch pipeline computes historical features and writes to an offline store (BigQuery, S3/Parquet) for training. The feature store API provides point-in-time correct historical features — crucial to avoid leakage (using future data to predict the past during training).

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you build a real-time fraud detection system that is both fast and accurate?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A production fraud detection system uses multiple layers with increasing computational cost and accuracy. Layer 1 (sub-1ms): Redis-based blocklist and allowlist lookups for known fraud cards, IPs, and devices. Layer 2 (1-5ms): velocity checks using Redis sliding window counters — “more than 5 transactions in 10 minutes from this device.” Layer 3 (20-50ms): ML model scoring using pre-computed features from a feature store (Redis caches 7-day averages, 1-hour counts computed by a Flink stream processor). The model (LightGBM or a neural network) produces a fraud probability score. Decisions: score > 0.95 → block; 0.7-0.95 → challenge (send 2FA); < 0.7 → allow. The total pipeline runs in under 100ms, meeting payment authorization latency requirements. Accuracy is maintained by retraining the model daily on labeled data (chargebacks and disputes provide ground truth, with 30-90 day delay)."
}
},
{
"@type": "Question",
"name": "How do you handle the class imbalance problem in fraud detection?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Fraud is rare — typically 0.01%-0.1% of transactions. Training a model on raw data where 99.9% of examples are "not fraud" produces a model that predicts "not fraud" for everything and achieves 99.9% accuracy while catching zero fraud. Several techniques address this: (1) Class weighting: assign higher loss weight to fraud examples (LightGBM scale_pos_weight parameter). The model penalizes missing fraud more than missing legitimate transactions. (2) Oversampling: SMOTE (Synthetic Minority Oversampling Technique) creates synthetic fraud examples by interpolating between existing fraud examples in feature space. (3) Undersampling: randomly remove majority class examples to balance the training set. Loses information but reduces training time. (4) Threshold tuning: instead of using 0.5 as the decision threshold, tune it to the desired precision/recall tradeoff on a validation set. Lowering the threshold catches more fraud (higher recall) at the cost of more false positives (lower precision). The right threshold depends on the business cost of each type of error."
}
},
{
"@type": "Question",
"name": "What is a feature store and why is it important for ML systems?",
"acceptedAnswer": {
"@type": "Answer",
"text": "A feature store is a centralized repository for ML features that solves two problems: (1) Training-serving skew: if features are computed differently during training (batch job) and serving (real-time), the model sees different distributions and performs worse in production than on validation data. A feature store computes features consistently using the same logic for both training and serving. (2) Feature reuse: multiple ML models (fraud detection, recommendation, credit scoring) often need the same features (user transaction history, account age). A feature store computes each feature once and makes it available to all models, avoiding redundant computation. Architecture: raw events are written to Kafka; a stream processor (Flink) computes real-time aggregates and writes to an online store (Redis, DynamoDB) for low-latency serving; a batch pipeline computes historical features and writes to an offline store (BigQuery, S3/Parquet) for training. The feature store API provides point-in-time correct historical features — crucial to avoid leakage (using future data to predict the past during training)."
}
}
]
}

  • DoorDash Interview Guide
  • Uber Interview Guide
  • Shopify Interview Guide
  • Companies That Ask This Question

    Scroll to Top