AI/ML Interview: Model Evaluation Metrics — Precision, Recall, F1, AUC-ROC, Confusion Matrix, Cross-Validation

Choosing the right evaluation metric is as important as choosing the right model. Using accuracy for an imbalanced fraud detection dataset (0.1% fraud) gives 99.9% by always predicting “not fraud” — useless. This guide covers every evaluation metric you need for ML interviews, when to use each, and the common pitfalls that trip up candidates.

Confusion Matrix and Basic Metrics

The confusion matrix for binary classification has four cells: True Positive (TP — predicted positive, actually positive), False Positive (FP — predicted positive, actually negative — Type I error), False Negative (FN — predicted negative, actually positive — Type II error), and True Negative (TN — predicted negative, actually negative). From these: Accuracy = (TP + TN) / (TP + FP + FN + TN). Misleading with imbalanced classes. Precision = TP / (TP + FP). “Of all predictions labeled positive, how many were actually positive?” High precision = few false alarms. Critical for: spam detection (marking a legitimate email as spam is costly), search results (irrelevant results frustrate users). Recall (Sensitivity) = TP / (TP + FN). “Of all actual positives, how many did we catch?” High recall = few missed positives. Critical for: disease detection (missing a cancer diagnosis is dangerous), fraud detection (missing a fraudulent transaction is costly). F1 Score = 2 * (Precision * Recall) / (Precision + Recall). Harmonic mean of precision and recall. Balances both. Use when you need a single metric that accounts for both false positives and false negatives. The precision-recall tradeoff: increasing the classification threshold increases precision (fewer false positives) but decreases recall (more false negatives). The optimal threshold depends on the business cost of each error type.

AUC-ROC and AUC-PR

ROC Curve: plots True Positive Rate (Recall) vs False Positive Rate (FP / (FP + TN)) at various classification thresholds. A perfect classifier has AUC-ROC = 1.0 (curve goes to the top-left corner). A random classifier has AUC-ROC = 0.5 (diagonal line). AUC-ROC measures the model ability to distinguish between classes across all thresholds. It is threshold-independent — useful when you have not decided on a threshold yet. Limitation: AUC-ROC can be misleading with highly imbalanced datasets. A model that correctly identifies most negatives (high TN) gets a high AUC-ROC even if it misses many positives. Precision-Recall Curve: plots Precision vs Recall at various thresholds. AUC-PR is the area under this curve. Better for imbalanced datasets (focuses on the positive class performance). A high AUC-PR means the model achieves high precision AND high recall for the minority class. When to use which: balanced classes — AUC-ROC. Imbalanced classes (fraud, disease, rare events) — AUC-PR. In interviews, always mention AUC-PR for imbalanced problems. Many candidates default to AUC-ROC without considering class imbalance — this signals shallow understanding.

Regression Metrics

For regression (predicting continuous values): MAE (Mean Absolute Error) = mean(|y_pred – y_true|). Interprets as: “on average, the prediction is off by X units.” Robust to outliers (does not square the error). MSE (Mean Squared Error) = mean((y_pred – y_true)^2). Penalizes large errors more (squared term). The standard loss function for regression. RMSE = sqrt(MSE). Same units as the target variable (more interpretable than MSE). R-squared (R^2) = 1 – (SS_res / SS_total). Proportion of variance explained by the model. R^2 = 1.0: perfect predictions. R^2 = 0.0: the model is no better than predicting the mean. R^2 < 0: the model is worse than predicting the mean (bad model). MAPE (Mean Absolute Percentage Error) = mean(|y_pred – y_true| / |y_true|) * 100. Percentage error — scale-independent. Problem: undefined when y_true = 0. Misleading when values are near zero. When to use: MAE for: interpretable "average error," robust to outliers. MSE/RMSE for: standard regression evaluation, penalizing large errors. R^2 for: "how much variance does the model explain?" MAPE for: business reporting (percentage errors are intuitive to stakeholders).

Cross-Validation

A single train/test split may give unreliable estimates (the test set might be unusually easy or hard). Cross-validation provides a more robust evaluation. K-Fold CV: split data into K folds (typically K=5 or K=10). For each fold i: train on all folds except i, evaluate on fold i. Average the K evaluation scores. This uses all data for both training and evaluation (every example is in the test set exactly once). Stratified K-Fold: for classification, ensure each fold has the same class distribution as the full dataset. Critical for imbalanced classes (without stratification, some folds may have 0 positive examples). Leave-One-Out (LOO): K = N (each example is a fold). Maximum use of training data but computationally expensive (N training runs). Used when the dataset is very small. Time-Series CV: for temporal data, standard K-Fold violates the temporal order (training on future data to predict the past — data leakage). Use expanding window or sliding window CV: train on data up to time T, evaluate on T+1 to T+K. Shift forward and repeat. When to use: K-Fold for: general evaluation, comparing models, and hyperparameter tuning. Stratified K-Fold for: imbalanced classification. Time-Series CV for: any data with temporal ordering (stock prices, user behavior, sensor data). In interviews: always mention stratified K-Fold for imbalanced data and time-series CV for temporal data. These show awareness of common pitfalls.

Ranking and Recommendation Metrics

For search and recommendation systems: NDCG (Normalized Discounted Cumulative Gain): measures ranking quality accounting for position. Higher-ranked relevant items contribute more to the score. NDCG@K evaluates only the top-K results. Range [0, 1]. The standard metric for search engines and recommendation systems. MAP (Mean Average Precision): average of precision at each relevant item position. Rewards models that rank all relevant items high. MRR (Mean Reciprocal Rank): 1 / position of the first correct result, averaged across queries. Measures how quickly the first relevant result appears. Recall@K: fraction of relevant items that appear in the top-K. “Of all items the user would like, what fraction did we show in the top 10?” Hit Rate@K: did at least one relevant item appear in the top-K? Binary (yes/no). For recommendations: NDCG@10 (are the best items ranked highest?), Recall@20 (did we retrieve most relevant items in the candidate set?), and business metrics (click-through rate, conversion rate, watch time). Offline metrics (NDCG, Recall) evaluate the model on historical data. Online metrics (CTR, conversion) evaluate in production via A/B testing. A model with better offline metrics may not improve online metrics (the offline evaluation does not capture user behavior changes). Always validate with A/B tests.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”When should you use AUC-ROC versus AUC-PR for evaluation?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”AUC-ROC for balanced classes: plots True Positive Rate vs False Positive Rate. Measures overall discrimination ability. AUC=1.0 is perfect, AUC=0.5 is random. Works well when classes are roughly balanced. AUC-PR for imbalanced classes (fraud, disease, rare events): plots Precision vs Recall, focusing on positive class performance. A model with high TN (correctly identifying most negatives) gets high AUC-ROC even if it misses many positives. AUC-PR penalizes this because it ignores TN entirely. For fraud detection (0.1% fraud): AUC-ROC might show 0.99 while AUC-PR reveals 0.15 — the model barely catches any fraud despite seemingly great ROC. In interviews: always mention AUC-PR for imbalanced problems. Defaulting to AUC-ROC without considering imbalance signals shallow understanding.”}},{“@type”:”Question”,”name”:”What cross-validation strategy should you use for time-series data?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Standard K-Fold CV violates temporal order for time-series: it trains on future data to predict the past (data leakage). Use time-series CV: expanding window (train on data up to time T, evaluate on T+1 to T+K, shift forward) or sliding window (fixed training window that slides forward). This ensures the model always trains on past data and evaluates on future data, mimicking production conditions. For classification with imbalanced classes: use Stratified K-Fold (ensures each fold has the same class distribution). For very small datasets: Leave-One-Out (K=N, maximum training data usage). General rule: 5-fold or 10-fold stratified CV for most problems. Time-series CV for temporal data. Always ensure the validation strategy matches how the model will be used in production.”}}]}
Scroll to Top