Overfitting and Regularization: L1, L2, Dropout, and Early Stopping

Overfitting is the most common failure mode in machine learning. Every ML interview will test your ability to recognize it and fix it. Knowing the mechanics of L1, L2, dropout, and early stopping — not just that they exist — is what separates candidates who’ve shipped production models from those who’ve only taken courses.

This post focuses on the practical application of regularization techniques. For the underlying theory, see the bias-variance tradeoff post.

What Overfitting Looks Like

Epoch   Train Loss   Val Loss
  10      0.50         0.55   ← both decreasing, model learning
  50      0.20         0.28   ← good, generalizing well
 100      0.08         0.35   ← gap widening, start to overfit
 200      0.02         0.65   ← severe overfitting, val loss rising
 300      0.01         0.90   ← memorizing training data

The signal: training loss keeps falling while validation loss starts rising. The model has learned the training data’s noise, not the underlying pattern.

L2 Regularization (Ridge / Weight Decay)

L2 adds a penalty to the loss function equal to the sum of squared weights:

Loss_total = Loss_original + λ × Σ(w²)

Gradient update with L2:
  w = w - α × (∂Loss/∂w + 2λw)
  w = w × (1 - 2αλ) - α × ∂Loss/∂w   ← weight decay term

The (1 - 2αλ) factor shrinks every weight toward zero on each update — that’s why L2 is also called weight decay. Large weights are penalized more than small ones (quadratic penalty). The model is forced to distribute weight across many features rather than relying heavily on a few.

In PyTorch:

optimizer = torch.optim.Adam(
    model.parameters(),
    lr=0.001,
    weight_decay=1e-4   # ← this is λ, the L2 penalty coefficient
)
# weight_decay applies L2 regularization automatically during each step

When to use L2: Default regularization for neural networks and linear models. Nearly always a good starting point. Doesn’t zero out weights — all features remain in the model, just with smaller coefficients.

Typical λ values: 1e-4 to 1e-2. Tune via cross-validation. Too high → underfitting (model too constrained).

L1 Regularization (Lasso)

L1 adds the sum of absolute values of weights:

Loss_total = Loss_original + λ × Σ|w|

The key difference from L2: L1’s gradient is constant (±λ, depending on sign of w), not proportional to the weight magnitude. This means L1 pushes weights all the way to exactly zero — it produces sparse solutions.

# L1 vs L2 effect on weights (conceptual)
Initial weights: [2.5, 0.3, 1.8, 0.05, 3.1]

After L2 (shrinks all proportionally):
  [1.8, 0.2, 1.3, 0.04, 2.2]   ← all reduced, none eliminated

After L1 (drives small weights to zero):
  [1.9, 0.0, 1.4, 0.00, 2.3]   ← sparse, automatic feature selection

When to use L1:

  • When you suspect many features are irrelevant and want automatic feature selection.
  • High-dimensional sparse data (text features, genomics).
  • When model interpretability matters — a sparse model is easier to explain.

Elastic Net combines both: λ₁Σ|w| + λ₂Σw². Gets sparsity from L1 and stability from L2.

Dropout

Dropout randomly sets a fraction of neurons to zero during each training forward pass. The fraction is the dropout rate (typically 0.2–0.5).

import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(512, 256)
        self.dropout = nn.Dropout(p=0.3)   # 30% of neurons dropped each forward pass
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)    # only during training; disabled at eval time
        return self.fc2(x)

model.train()  # dropout active
model.eval()   # dropout disabled — full network used for inference

Why it works: Each training step uses a different random subnetwork. The model can’t co-adapt neurons — it can’t rely on specific neuron combinations that only appear in the training data. Instead, it must learn robust features that work regardless of which neurons are present.

At inference time, dropout is disabled and weights are scaled by (1 – dropout_rate) to compensate for the fact that all neurons are now active.

When to use dropout:

  • Dense layers in large neural networks — very effective.
  • Less effective on convolutional layers (use spatial dropout instead).
  • Dropout rate of 0.5 is standard for fully-connected layers; 0.2 for earlier layers.
  • Not useful for small networks — if the model is already small, making it smaller via dropout is counterproductive.

Early Stopping

Stop training when validation loss stops improving, even if training loss is still decreasing.

best_val_loss = float('inf')
patience = 10   # stop after 10 epochs of no improvement
no_improve_count = 0

for epoch in range(1000):
    train_one_epoch(model, train_loader)
    val_loss = evaluate(model, val_loader)

    if val_loss = patience:
            print(f"Early stopping at epoch {epoch}")
            break

# Load the best checkpoint, not the last
model.load_state_dict(torch.load('best_model.pt'))

Key implementation detail: Save the model at the best validation loss, not the last epoch. When early stopping triggers, restore that saved checkpoint — not the overfit final state.

When to use: Almost always, as a cheap safety net alongside other regularization. Computationally free — you’re not doing extra work, you’re doing less.

Data Augmentation

If you can’t get more real data, synthetically expand what you have. Makes the model invariant to transformations that shouldn’t change the label.

from torchvision import transforms

train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),         # cat is still a cat when flipped
    transforms.RandomRotation(15),             # ±15 degrees rotation
    transforms.ColorJitter(brightness=0.2),    # lighting variation
    transforms.RandomCrop(224, padding=4),     # slight position variation
    transforms.ToTensor(),
    transforms.Normalize(mean, std),
])

For text: synonym replacement, random word insertion/deletion, back-translation (translate to French and back). For audio: time stretching, pitch shifting, adding background noise.

Batch Normalization

Batch norm normalizes activations within each mini-batch. It reduces internal covariate shift and has a mild regularizing effect (adding noise from batch statistics). It’s not primarily a regularization technique, but it often reduces the need for dropout.

self.bn = nn.BatchNorm1d(256)

def forward(self, x):
    x = self.fc1(x)
    x = self.bn(x)         # normalize activations
    x = torch.relu(x)
    return self.fc2(x)

Choosing the Right Technique

| Symptom | First try | Also try |
|—|—|—|
| Large train/val gap, big model | Dropout 0.3–0.5 | L2, more data |
| Large train/val gap, linear model | L2 (Ridge) | More data |
| Too many features, want sparsity | L1 (Lasso) | Elastic Net |
| Training too long, val loss rising | Early stopping | LR scheduler |
| Small dataset, images | Data augmentation | Transfer learning |
| Small dataset, tabular | L2 + cross-validation | Simpler model |

Interview Answer Template

When asked “your model is overfitting — what do you do?”:

“First I’d check the severity — how large is the train/validation gap? Then I’d try, roughly in order: (1) collect more data if possible, (2) add dropout if it’s a neural net or L2 regularization if it’s a linear model, (3) add early stopping to find the optimal stopping point, (4) data augmentation if it’s an image/audio problem, (5) reduce model complexity as a last resort since it can introduce bias. I’d monitor validation loss throughout and tune the regularization strength via cross-validation.”

Summary

L2 regularization shrinks all weights toward zero proportionally — the standard default for neural networks. L1 drives weights to exactly zero, giving sparse models and automatic feature selection. Dropout randomly disables neurons during training, forcing robust feature learning. Early stopping prevents the model from training past the point where generalization improves. Data augmentation synthetically expands training data. These techniques are complementary — most production models use at least two of them simultaneously.

Related ML Topics

  • Bias-Variance Tradeoff — the theoretical foundation; regularization is the practical toolkit for managing it.
  • Gradient Descent — L2 regularization modifies the gradient update rule directly (weight decay term).

See also: How Backpropagation Works — understanding the forward/backward pass helps explain why L2 regularization is applied to weights during the update step.

See also: Classification Metrics: Precision, Recall, F1, and AUC-ROC — how to measure whether regularization actually improved your model’s generalization on the validation set.

See also: Feature Selection and Dimensionality Reduction — L1 regularization as embedded feature selection, and how removing features controls model complexity.

See also: Fine-tuning LLMs vs Training from Scratch — catastrophic forgetting during LLM fine-tuning, and how low learning rates and mixed training data control it.

See also: How Does RLHF Work? — reward model overfitting is a central failure mode; KL penalty as a regularizer against policy drift from the reference model.

Scroll to Top