Classification Metrics: Precision, Recall, F1, and AUC-ROC

Classification metrics are one of the most frequently misused concepts in ML interviews. The wrong answer: “I use accuracy.” The right answer: it depends on the cost of each type of error. This post walks through what interviewers are actually testing and how to structure a rigorous answer.

What the Interviewer Is Testing

At the junior level: do you know the definitions? At the senior/staff level: can you pick the right metric given a business objective, reason about the cost of false positives vs false negatives, and explain multi-class extensions without prompting?

The Confusion Matrix: Start Here

Every classification metric derives from the confusion matrix. For binary classification:

                Predicted Positive    Predicted Negative
Actual Positive      TP                    FN
Actual Negative      FP                    TN

TP (True Positive): Model said positive, it was positive. Correct.
TN (True Negative): Model said negative, it was negative. Correct.
FP (False Positive / Type I error): Model said positive, it was negative. False alarm.
FN (False Negative / Type II error): Model said negative, it was positive. Miss.

Every interview question about metrics is really a question about which of FP and FN is more costly.

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

When to use: only when classes are balanced and both error types have equal cost.

When it fails: 99% of emails are legitimate. A classifier that always predicts “not spam” achieves 99% accuracy and is completely useless. Never cite accuracy for imbalanced datasets — it’s a red flag in interviews.

Precision

Precision = TP / (TP + FP)

“Of all the things the model flagged as positive, how many actually were?” Precision measures the quality of positive predictions.

When FP is costly: Use precision. Spam detection: a false positive (flagging real email as spam) is a disaster — users miss important messages. Fraud detection for freezing accounts: falsely freezing a legitimate account enrages customers and has legal risk. Optimize for precision.

from sklearn.metrics import precision_score
precision_score(y_true, y_pred)  # binary
precision_score(y_true, y_pred, average='macro')  # multi-class

Recall (Sensitivity, True Positive Rate)

Recall = TP / (TP + FN)

“Of all the actual positives, how many did the model catch?” Recall measures coverage of the positive class.

When FN is costly: Use recall. Cancer screening: a false negative (missing a malignant tumor) means delayed treatment and potentially death. COVID test: missing an infected person who then spreads the virus. Fraud detection for flagging for review (not auto-blocking): you’d rather review more accounts than miss real fraud.

from sklearn.metrics import recall_score
recall_score(y_true, y_pred)

The Precision-Recall Trade-off

Precision and recall are inversely related via the classification threshold. If you lower the threshold (predict positive more aggressively), recall goes up (you catch more true positives) but precision goes down (you also catch more false positives).

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

# Find threshold that gives recall >= 0.90
target_recall = 0.90
idx = np.argmax(recall >= target_recall)
print(f"Threshold: {thresholds[idx]:.3f}, Precision: {precision[idx]:.3f}, Recall: {recall[idx]:.3f}")

In production, you don’t just pick a model — you pick a threshold. The Precision-Recall curve shows all possible operating points.

F1 Score

F1 = 2 × (Precision × Recall) / (Precision + Recall)

The harmonic mean of precision and recall. Use F1 when you need a single number and care about both metrics roughly equally. Why harmonic mean? It penalizes extreme values — a model with 1.0 precision and 0.01 recall gets F1 = 0.02, correctly reflecting that it’s nearly useless.

Fbeta: When you want to weight recall more than precision: F2 score (beta=2) gives recall twice the weight. When precision matters more: F0.5.

from sklearn.metrics import fbeta_score
# Beta=2: weights recall twice as heavily
fbeta_score(y_true, y_pred, beta=2)

AUC-ROC

The ROC (Receiver Operating Characteristic) curve plots True Positive Rate (recall) vs False Positive Rate at every threshold:

FPR = FP / (FP + TN)  # "false alarm rate"
TPR = TP / (TP + FN)  # recall

from sklearn.metrics import roc_auc_score, roc_curve
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc = roc_auc_score(y_true, y_scores)
print(f"AUC: {auc:.3f}")

AUC interpretation: the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example by the model. AUC = 0.5 means random guessing. AUC = 1.0 means perfect ranking. A model with AUC = 0.85 ranks positives above negatives 85% of the time.

AUC vs accuracy: AUC is threshold-independent — it measures ranking quality across all thresholds. This makes it much more useful for imbalanced datasets and for comparing models when you don’t know the operating threshold yet.

Precision-Recall AUC vs ROC AUC

For heavily imbalanced datasets (1% positive rate), PR-AUC is more informative than ROC-AUC. ROC-AUC can look great (0.95) even when PR-AUC is poor (0.30) because ROC includes TN in its calculation, and TN is abundant when negatives dominate. PR curve only looks at the positive class.

from sklearn.metrics import average_precision_score
pr_auc = average_precision_score(y_true, y_scores)

Rule of thumb: use ROC-AUC when classes are roughly balanced, PR-AUC when the positive class is rare.

Multi-class Extension

For K-class classification, compute per-class metrics then aggregate:

Macro average: Compute metric for each class, take unweighted mean. Treats all classes equally regardless of support. Use when all classes matter equally.
Micro average: Aggregate all TPs, FPs, FNs across classes, then compute metric. Weights by class frequency — dominated by large classes. Use when overall system performance matters.
Weighted average: Weight each class by its support (number of true instances). Handles class imbalance.

from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred, target_names=['cat', 'dog', 'bird']))
# Shows per-class precision/recall/F1 + macro/micro/weighted averages

Choosing the Right Metric: Decision Framework

Ask these questions in order:

Are classes balanced? If no: eliminate accuracy immediately.
What’s costlier: FP or FN?
- FP costly (false alarm bad) → optimize precision
- FN costly (miss bad) → optimize recall
- Both matter equally → F1 or AUC
Do you know the operating threshold? If yes: use precision/recall/F1 at that threshold. If no: use AUC.
Is the positive class rare? Use PR-AUC over ROC-AUC.

Worked Examples

Spam filter: Precision. Missing spam (FN) is annoying; flagging real email (FP) is unacceptable. Operate at very high precision (99%+), accept lower recall.

Cancer screening: Recall. Missing cancer (FN) is fatal; extra biopsies (FP) are costly but manageable. Operate at very high recall (95%+), accept lower precision.

Fraud detection (auto-block): Precision — don’t block legitimate users. Fraud detection (flag for review): Recall — catch as much fraud as possible for human review.

Search ranking: Mean Average Precision (MAP) or NDCG — ranking metrics, not classification metrics. “Are the most relevant results at the top?”

Self-driving car: pedestrian detection: Recall, overwhelmingly. Missing a pedestrian (FN) causes injury; a false brake (FP) is uncomfortable but safe.

Common Interview Mistakes

Using accuracy on imbalanced datasets without flagging the issue
Confusing FPR (used in ROC) with FP (used in precision) — they’re different
Not mentioning that threshold selection is separate from metric selection
Saying “use F1 always” without considering the cost asymmetry between FP and FN
Forgetting that AUC measures ranking, not calibration — a model can have high AUC and badly miscalibrated probabilities