Bias-Variance Tradeoff: Underfitting, Overfitting, and How to Fix Both

The bias-variance tradeoff is one of the first concepts asked in any machine learning interview. It underpins model selection, regularization, and the entire discipline of avoiding overfit models. If you can’t explain it clearly and then diagnose a given model’s problem, you’ll struggle in ML-focused interviews at any level.

Strategy

Start with intuition, then layer in the math. The interviewer wants to know you understand what’s happening in the model, not just that you’ve memorized the terms. Connect it immediately to practical actions: “the model has high bias, so I’d…”

The Core Intuition

Imagine you’re trying to predict house prices. You train a model on your training data.

A model that’s too simple (e.g., predict the mean price for every house) makes systematic errors — it misses the real patterns. It performs badly on training data and test data. This is underfitting, and it’s caused by high bias.

A model that’s too complex (e.g., a decision tree that memorizes every training example) fits the training data perfectly but falls apart on new data — it learned the noise, not the signal. This is overfitting, and it’s caused by high variance.

The tradeoff: As you increase model complexity, bias decreases and variance increases. There’s a sweet spot where the combination — total error — is minimized. Finding that sweet spot is what model selection and regularization are about.

The Math (Decomposition of Expected Error)

For a given test point, the expected prediction error of a model can be decomposed as:

Expected Error = Bias² + Variance + Irreducible Noise

where:
  Bias²     = (avg prediction - true value)²  → systematic error
  Variance  = spread of predictions across different training sets
  Noise     = inherent randomness in the data (can't be reduced)

Bias measures how far off the model’s average prediction is from the truth. A high-bias model makes consistent, systematic mistakes regardless of which training data you use.

Variance measures how much the model’s predictions change when you retrain it on different samples of data. A high-variance model is very sensitive to the specific training set — small changes in data lead to large changes in the model.

Diagnosing the Problem

In an interview, you’ll often be shown train/validation error numbers and asked to diagnose:

Case 1: High Bias (Underfitting)
  Training error:   15%
  Validation error: 16%
  → Both errors are high. Model is too simple.

Case 2: High Variance (Overfitting)
  Training error:   1%
  Validation error: 18%
  → Large gap between train and validation. Model memorized training data.

Case 3: Good Fit
  Training error:   3%
  Validation error: 5%
  → Low error, small gap. Model generalizes well.

Case 4: High Bias AND High Variance
  Training error:   12%
  Validation error: 20%
  → Both high, large gap. Worst case — wrong model family.

The gap between training and validation error is the key signal for variance. The absolute level of training error is the key signal for bias.

Fixing High Bias (Underfitting)

The model is too simple. You need more capacity:

  • Increase model complexity: More layers in a neural network, higher polynomial degree, deeper decision tree.
  • Add features: Engineer new features that capture the signal the model is missing.
  • Reduce regularization: If you have L1/L2 penalty, reduce the λ coefficient — you’re penalizing complexity too aggressively.
  • Train longer: If using gradient descent, the model may not have converged yet.
  • Try a different model family: If linear regression isn’t capturing a nonlinear relationship, try a tree-based model or neural network.

Fixing High Variance (Overfitting)

The model is memorizing training data. You need to constrain it:

  • Get more training data: The most reliable fix. More data forces the model to learn patterns, not noise. Adding 10× data often outperforms any algorithmic regularization.
  • Reduce model complexity: Fewer layers, shallower tree, lower polynomial degree.
  • Regularization:
    • L2 (Ridge): Adds a penalty proportional to the sum of squared weights (λ Σ w²). Shrinks all weights toward zero but doesn’t eliminate them. Good default.
    • L1 (Lasso): Adds a penalty proportional to the sum of absolute weights (λ Σ |w|). Produces sparse solutions — many weights go exactly to zero. Good for feature selection.
    • Elastic Net: Combines L1 and L2.
    • Dropout: For neural networks — randomly zero out neurons during training, forcing redundant representations.
  • Early stopping: Stop training when validation error starts increasing, even if training error is still decreasing.
  • Data augmentation: Artificially expand training data (flip images, add noise) to make the model more robust.
  • Ensemble methods: Average predictions from many models (bagging) to reduce variance. Random Forests reduce variance vs. a single decision tree by averaging many trees trained on data subsamples.

The Complexity Curve

Error
  ↑
  │  Training error ─────────────────────
  │                                       ___________
  │
  │  Validation error ──────
  │                                        /
  │                           ____________/
  │                              ↑
  │                         Sweet spot
  └───────────────────────────────────────────────→ Model Complexity

As complexity increases: training error monotonically decreases. Validation error decreases then increases (overfitting). The optimal complexity minimizes validation error.

Practical Interview Application

Q: Your model has 99% training accuracy and 65% validation accuracy. What do you do?

This is high variance. In order: (1) collect more data, (2) add dropout or L2 regularization, (3) reduce model size, (4) add early stopping, (5) try ensemble methods.

Q: Your model has 65% training accuracy and 64% validation accuracy. What do you do?

This is high bias. The model isn’t learning from the data. (1) Increase model capacity, (2) engineer better features, (3) reduce regularization, (4) try a more powerful model family.

Q: How does the size of the training set affect bias and variance?

More data reduces variance (the model can’t overfit noise when there’s a lot of signal). More data has diminishing returns on bias — if the model family can’t represent the true relationship, more data won’t fix it.

Q: Why does L2 regularization reduce overfitting?

L2 penalizes large weights. Large weights mean the model is very sensitive to specific input values — high variance. By keeping weights small, L2 forces the model to rely on many features slightly rather than a few features strongly, which generalizes better.

Ensemble Methods and the Tradeoff

Ensemble methods exploit the bias-variance tradeoff explicitly:

  • Bagging (Bootstrap Aggregating): Train many high-variance, low-bias models (e.g., deep decision trees) on random data subsamples and average their predictions. Reduces variance without increasing bias. Used by Random Forests.
  • Boosting: Train many high-bias, low-variance models sequentially, each correcting the errors of the previous. Reduces bias. Used by XGBoost, AdaBoost, LightGBM.

Summary

Bias measures systematic error — the model is consistently wrong in the same direction. Variance measures sensitivity to training data — the model behaves differently on different training sets. Underfitting (high bias) is diagnosed by high training error; fix with more complexity or better features. Overfitting (high variance) is diagnosed by a large train/validation error gap; fix with more data, regularization, or simpler models. The optimal model sits at the minimum of the validation error curve — complex enough to capture the signal, constrained enough to ignore the noise.

See also: Overfitting and Regularization — the practical techniques (L1, L2, dropout, early stopping) for applying this framework to real models.

See also: Classification Metrics — precision and recall let you see whether your model’s errors are FPs or FNs, making bias-variance analysis actionable.

See also: Cross-Validation Strategies — CV curves (train vs val score across k folds) are a practical tool for diagnosing whether you have high bias or high variance.

Scroll to Top