Imbalanced datasets — where one class dramatically outnumbers another — are the norm in production ML, not the exception. Fraud detection: 0.1% fraud. Medical diagnosis: 1% positive. Churn prediction: 5% churn. Getting this wrong produces models that achieve “99% accuracy” by predicting the majority class for everything and are completely worthless.
What the Interviewer Is Testing
Can you identify when class imbalance is a problem, choose the right metric, and select the appropriate remediation strategy? Interviewers want to see that you understand the trade-offs between resampling approaches and know when to reach for each one.
Why Accuracy Fails
Start here every time. If 99% of your samples are class 0 and 1% are class 1:
Classifier that always predicts 0:
Accuracy: 99% ✓ (looks great!)
Precision: undefined (no positive predictions)
Recall: 0% ✗ (catches zero fraud cases)
F1: 0% ✗
In interviews, the moment you see imbalanced data, say: “Accuracy is not a useful metric here. I’ll use [precision/recall/F1/PR-AUC] because [reason tied to the cost of FP vs FN].”
Resampling: Oversampling and Undersampling
Random Undersampling
Randomly remove majority class samples until the ratio is balanced. Fast, simple. Loses potentially useful information.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
X_res, y_res = rus.fit_resample(X_train, y_train)
print(f"After undersampling: {sum(y_res==0)} negative, {sum(y_res==1)} positive")
Use when: you have so much majority-class data that discarding it is acceptable (millions of samples). Don’t use when your total dataset is small — you’ll lose too much signal.
Random Oversampling
Duplicate minority class samples at random. Creates exact copies — can lead to severe overfitting on the minority class.
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy=0.5, random_state=42)
X_res, y_res = ros.fit_resample(X_train, y_train)
Better than nothing, but SMOTE is almost always preferred.
SMOTE (Synthetic Minority Oversampling Technique)
Instead of duplicating minority samples, SMOTE creates synthetic samples. For each minority sample, find its k nearest neighbors in the minority class, then create a new sample somewhere along the line segment between the sample and one of its neighbors.
from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy=0.5, k_neighbors=5, random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)
print(f"Original: {sum(y_train==1)} minority samples")
print(f"After SMOTE: {sum(y_res==1)} minority samples")
SMOTE works in feature space, not raw sample space — it interpolates between real samples. This creates more diverse training examples and reduces overfitting compared to random duplication.
SMOTE pitfall: SMOTE can generate synthetic samples in regions dominated by the majority class, creating noise. Use SMOTE + Tomek Links or SMOTE + ENN (Edited Nearest Neighbors) to clean up borderline synthetic samples.
from imblearn.combine import SMOTETomek
smt = SMOTETomek(sampling_strategy='auto', random_state=42)
X_res, y_res = smt.fit_resample(X_train, y_train)
Critical warning: Only apply resampling to the training set. Never resample validation or test sets — you want to evaluate on the real class distribution.
# CORRECT: resample only training data
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
model.fit(X_train_res, y_train_res)
# Evaluate on original (imbalanced) X_val, y_val
score = model.score(X_val, y_val)
# In cross-validation: use Pipeline + imblearn's Pipeline
from imblearn.pipeline import Pipeline as ImbPipeline
pipe = ImbPipeline([
('smote', SMOTE(random_state=42)),
('model', RandomForestClassifier())
])
cross_val_score(pipe, X, y, cv=StratifiedKFold(5), scoring='f1')
Algorithm-Level Solutions
Class Weights
Most sklearn classifiers accept class_weight parameter. Setting class_weight='balanced' automatically adjusts weights inversely proportional to class frequencies — the minority class gets higher weight in the loss function, penalizing misclassification of rare events more heavily.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
# These all support class_weight
lr = LogisticRegression(class_weight='balanced')
rf = RandomForestClassifier(class_weight='balanced')
svm = SVC(class_weight='balanced')
# Or specify manually
lr = LogisticRegression(class_weight={0: 1, 1: 99}) # 99:1 imbalance
Why prefer this over SMOTE for tree models: Tree models with class_weight train on the original data distribution with adjusted split criterion. No synthetic data, no risk of SMOTE noise. Start here before trying resampling for tree-based models.
Threshold Adjustment
A classifier’s default decision threshold is 0.5: predict positive if P(positive) > 0.5. For imbalanced data, this threshold is often wrong — you might want to lower it to catch more positives (higher recall) at the cost of more false alarms.
from sklearn.metrics import precision_recall_curve
import numpy as np
y_proba = model.predict_proba(X_val)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_val, y_proba)
# Find threshold that achieves recall >= 0.85
target_recall = 0.85
idx = np.argmax(recall >= target_recall)
optimal_threshold = thresholds[idx]
print(f"Threshold: {optimal_threshold:.3f} → Precision: {precision[idx]:.3f}, Recall: {recall[idx]:.3f}")
# Apply custom threshold
y_pred_custom = (y_proba >= optimal_threshold).astype(int)
Threshold tuning is often more effective than resampling for production systems — it doesn’t distort the training distribution and lets you operate at any precision-recall point on the curve.
Focal Loss
Used by Facebook’s RetinaNet for object detection (extreme imbalance: background >> objects). Focal loss down-weights easy negative examples (majority class predictions the model is already confident about) and focuses training on hard examples (minority class and ambiguous cases).
import torch
import torch.nn as nn
class FocalLoss(nn.Module):
def __init__(self, alpha=0.25, gamma=2.0):
super().__init__()
self.alpha = alpha
self.gamma = gamma
def forward(self, inputs, targets):
bce = nn.functional.binary_cross_entropy_with_logits(inputs, targets, reduction='none')
p_t = torch.exp(-bce)
focal_weight = self.alpha * (1 - p_t) ** self.gamma
return (focal_weight * bce).mean()
Use focal loss for deep learning tasks with extreme imbalance (1:1000+). For classical ML, class weights are simpler and usually sufficient.
Evaluation Under Imbalance
- Use PR-AUC, not ROC-AUC. ROC-AUC looks great even for bad imbalanced classifiers because it includes TN in FPR. PR-AUC focuses on the minority class.
- Report class-specific metrics. sklearn’s
classification_reportshows per-class precision/recall — always include this in imbalanced evaluations. - Use stratified CV. StratifiedKFold preserves class ratio in each fold.
from sklearn.metrics import classification_report, average_precision_score
print(classification_report(y_val, y_pred, target_names=['legitimate', 'fraud']))
pr_auc = average_precision_score(y_val, y_proba)
print(f"PR-AUC: {pr_auc:.3f}")
Choosing Your Strategy
| Situation | Recommended approach |
|---|---|
| Tree-based model, mild imbalance (1:10) | class_weight=’balanced’ |
| Linear model, mild imbalance | class_weight=’balanced’ or SMOTE |
| Severe imbalance (1:100+), moderate dataset | SMOTE + class_weight |
| Large dataset, can afford to discard majority | Random undersampling |
| Deep learning, extreme imbalance | Focal loss |
| Need specific precision or recall target | Threshold adjustment on any model |
Common Interview Mistakes
- Applying SMOTE to the full dataset before cross-validation (leakage)
- Reporting accuracy on imbalanced data without flagging it
- Using ROC-AUC when PR-AUC is more appropriate
- Resampling the validation/test set
- Claiming SMOTE always helps — it can hurt for tree models if the synthetic samples add noise
Related ML Topics
- Classification Metrics — imbalanced datasets require PR-AUC, not accuracy or ROC-AUC
- Cross-Validation Strategies — always use StratifiedKFold and apply SMOTE inside CV folds, not outside
- Overfitting and Regularization — SMOTE can cause overfitting on synthetic minority samples; regularize accordingly
- Feature Selection — imbalanced data can bias filter-method feature importance toward majority class patterns