Handling Imbalanced Datasets in Machine Learning

Imbalanced datasets — where one class dramatically outnumbers another — are the norm in production ML, not the exception. Fraud detection: 0.1% fraud. Medical diagnosis: 1% positive. Churn prediction: 5% churn. Getting this wrong produces models that achieve “99% accuracy” by predicting the majority class for everything and are completely worthless.

What the Interviewer Is Testing

Can you identify when class imbalance is a problem, choose the right metric, and select the appropriate remediation strategy? Interviewers want to see that you understand the trade-offs between resampling approaches and know when to reach for each one.

Why Accuracy Fails

Start here every time. If 99% of your samples are class 0 and 1% are class 1:

Classifier that always predicts 0:
  Accuracy: 99%  ✓ (looks great!)
  Precision: undefined (no positive predictions)
  Recall: 0%     ✗ (catches zero fraud cases)
  F1: 0%         ✗

In interviews, the moment you see imbalanced data, say: “Accuracy is not a useful metric here. I’ll use [precision/recall/F1/PR-AUC] because [reason tied to the cost of FP vs FN].”

Resampling: Oversampling and Undersampling

Random Undersampling

Randomly remove majority class samples until the ratio is balanced. Fast, simple. Loses potentially useful information.

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
X_res, y_res = rus.fit_resample(X_train, y_train)
print(f"After undersampling: {sum(y_res==0)} negative, {sum(y_res==1)} positive")

Use when: you have so much majority-class data that discarding it is acceptable (millions of samples). Don’t use when your total dataset is small — you’ll lose too much signal.

Random Oversampling

Duplicate minority class samples at random. Creates exact copies — can lead to severe overfitting on the minority class.

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy=0.5, random_state=42)
X_res, y_res = ros.fit_resample(X_train, y_train)

Better than nothing, but SMOTE is almost always preferred.

SMOTE (Synthetic Minority Oversampling Technique)

Instead of duplicating minority samples, SMOTE creates synthetic samples. For each minority sample, find its k nearest neighbors in the minority class, then create a new sample somewhere along the line segment between the sample and one of its neighbors.

from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy=0.5, k_neighbors=5, random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)

print(f"Original: {sum(y_train==1)} minority samples")
print(f"After SMOTE: {sum(y_res==1)} minority samples")

SMOTE works in feature space, not raw sample space — it interpolates between real samples. This creates more diverse training examples and reduces overfitting compared to random duplication.

SMOTE pitfall: SMOTE can generate synthetic samples in regions dominated by the majority class, creating noise. Use SMOTE + Tomek Links or SMOTE + ENN (Edited Nearest Neighbors) to clean up borderline synthetic samples.

from imblearn.combine import SMOTETomek
smt = SMOTETomek(sampling_strategy='auto', random_state=42)
X_res, y_res = smt.fit_resample(X_train, y_train)

Critical warning: Only apply resampling to the training set. Never resample validation or test sets — you want to evaluate on the real class distribution.

# CORRECT: resample only training data
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
model.fit(X_train_res, y_train_res)
# Evaluate on original (imbalanced) X_val, y_val
score = model.score(X_val, y_val)

# In cross-validation: use Pipeline + imblearn's Pipeline
from imblearn.pipeline import Pipeline as ImbPipeline
pipe = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('model', RandomForestClassifier())
])
cross_val_score(pipe, X, y, cv=StratifiedKFold(5), scoring='f1')

Algorithm-Level Solutions

Class Weights

Most sklearn classifiers accept class_weight parameter. Setting class_weight='balanced' automatically adjusts weights inversely proportional to class frequencies — the minority class gets higher weight in the loss function, penalizing misclassification of rare events more heavily.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# These all support class_weight
lr = LogisticRegression(class_weight='balanced')
rf = RandomForestClassifier(class_weight='balanced')
svm = SVC(class_weight='balanced')

# Or specify manually
lr = LogisticRegression(class_weight={0: 1, 1: 99})  # 99:1 imbalance

Why prefer this over SMOTE for tree models: Tree models with class_weight train on the original data distribution with adjusted split criterion. No synthetic data, no risk of SMOTE noise. Start here before trying resampling for tree-based models.

Threshold Adjustment

A classifier’s default decision threshold is 0.5: predict positive if P(positive) > 0.5. For imbalanced data, this threshold is often wrong — you might want to lower it to catch more positives (higher recall) at the cost of more false alarms.

from sklearn.metrics import precision_recall_curve
import numpy as np

y_proba = model.predict_proba(X_val)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_val, y_proba)

# Find threshold that achieves recall >= 0.85
target_recall = 0.85
idx = np.argmax(recall >= target_recall)
optimal_threshold = thresholds[idx]
print(f"Threshold: {optimal_threshold:.3f} → Precision: {precision[idx]:.3f}, Recall: {recall[idx]:.3f}")

# Apply custom threshold
y_pred_custom = (y_proba >= optimal_threshold).astype(int)

Threshold tuning is often more effective than resampling for production systems — it doesn’t distort the training distribution and lets you operate at any precision-recall point on the curve.

Focal Loss

Used by Facebook’s RetinaNet for object detection (extreme imbalance: background >> objects). Focal loss down-weights easy negative examples (majority class predictions the model is already confident about) and focuses training on hard examples (minority class and ambiguous cases).

import torch
import torch.nn as nn

class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, inputs, targets):
        bce = nn.functional.binary_cross_entropy_with_logits(inputs, targets, reduction='none')
        p_t = torch.exp(-bce)
        focal_weight = self.alpha * (1 - p_t) ** self.gamma
        return (focal_weight * bce).mean()

Use focal loss for deep learning tasks with extreme imbalance (1:1000+). For classical ML, class weights are simpler and usually sufficient.

Evaluation Under Imbalance

Use PR-AUC, not ROC-AUC. ROC-AUC looks great even for bad imbalanced classifiers because it includes TN in FPR. PR-AUC focuses on the minority class.
Report class-specific metrics. sklearn’s classification_report shows per-class precision/recall — always include this in imbalanced evaluations.
Use stratified CV. StratifiedKFold preserves class ratio in each fold.

from sklearn.metrics import classification_report, average_precision_score

print(classification_report(y_val, y_pred, target_names=['legitimate', 'fraud']))
pr_auc = average_precision_score(y_val, y_proba)
print(f"PR-AUC: {pr_auc:.3f}")

Choosing Your Strategy

Situation	Recommended approach
Tree-based model, mild imbalance (1:10)	class_weight=’balanced’
Linear model, mild imbalance	class_weight=’balanced’ or SMOTE
Severe imbalance (1:100+), moderate dataset	SMOTE + class_weight
Large dataset, can afford to discard majority	Random undersampling
Deep learning, extreme imbalance	Focal loss
Need specific precision or recall target	Threshold adjustment on any model

Common Interview Mistakes

Applying SMOTE to the full dataset before cross-validation (leakage)
Reporting accuracy on imbalanced data without flagging it
Using ROC-AUC when PR-AUC is more appropriate
Resampling the validation/test set
Claiming SMOTE always helps — it can hurt for tree models if the synthetic samples add noise