Gradient descent is the engine behind nearly every machine learning model trained today. If you are interviewing for an ML engineering, data science, or research role, expect a question about it. The interviewer might ask you to explain it from scratch, describe the difference between variants, or diagnose why a model is failing to converge.
Strategy
Don’t just recite the update rule. Walk through the intuition first — why gradient descent works, what it’s actually doing geometrically — then layer on the variants and their trade-offs. End with practical guidance: when you’d pick Adam over SGD and why.
The Core Idea
Training a model means finding the parameter values (weights) that minimize a loss function — a measure of how wrong the model’s predictions are. The loss function defines a landscape over the parameter space: valleys are low loss (good), peaks are high loss (bad). We want to reach a valley.
Gradient descent does this iteratively:
- Start at some random point in the loss landscape (random weight initialization).
- Compute the gradient of the loss with respect to every parameter — the direction of steepest ascent.
- Step in the opposite direction (downhill) by a small amount (the learning rate).
- Repeat until the loss stops decreasing.
The update rule for a single parameter θ:
θ = θ - α * ∂L/∂θ
where:
α = learning rate (step size)
∂L/∂θ = gradient of loss L with respect to θ
The Three Variants
Batch Gradient Descent (Full GD)
Compute the gradient over the entire training dataset before taking one step.
for epoch in range(num_epochs):
gradient = compute_gradient(all_training_data, weights)
weights = weights - learning_rate * gradient
Pros: Stable, accurate gradient estimate. Converges smoothly.
Cons: Computing the gradient over millions of examples before taking one step is prohibitively slow. Uses enormous memory. Unusable for large datasets.
Stochastic Gradient Descent (SGD)
Compute the gradient and take a step after each individual example.
for epoch in range(num_epochs):
shuffle(training_data)
for x, y in training_data:
gradient = compute_gradient(x, y, weights)
weights = weights - learning_rate * gradient
Pros: Very fast updates — one step per example. Can escape local minima due to noisy gradient estimates (the noise helps exploration).
Cons: High variance in gradient estimates. The loss oscillates wildly instead of decreasing smoothly. Can overshoot minima.
Mini-Batch Gradient Descent
The standard in practice. Compute the gradient over a small batch (typically 32–512 examples) and take a step.
for epoch in range(num_epochs):
for batch in get_batches(training_data, batch_size=32):
gradient = compute_gradient(batch, weights)
weights = weights - learning_rate * gradient
Pros: Balances accuracy (better gradient estimate than SGD) and speed (faster than full GD). Efficient on GPU hardware (matrix operations on batches are heavily optimized). Moderate variance provides regularization benefit.
Cons: Another hyperparameter to tune (batch size). Batch size affects training dynamics in non-obvious ways.
When people say “SGD” in deep learning, they almost always mean mini-batch SGD.
The Learning Rate Problem
The learning rate α is the most important hyperparameter. Too large: the optimizer overshoots minima, loss diverges. Too small: training takes forever.
α too large: loss → explodes (NaN or bouncing)
α too small: loss decreases but training takes 10× as long
α just right: loss decreases smoothly and quickly
This motivated the development of adaptive learning rate methods — optimizers that automatically adjust the learning rate per parameter based on gradient history.
Adaptive Optimizers
Momentum
Accumulate a velocity vector in the gradient direction and add it to the update. Momentum helps the optimizer roll through flat regions and small local minima, like a ball rolling downhill.
velocity = 0
for batch in batches:
gradient = compute_gradient(batch, weights)
velocity = β * velocity + (1 - β) * gradient # β ≈ 0.9
weights = weights - α * velocity
RMSProp
Adapt the learning rate per parameter by dividing by the root mean square of recent gradients. Parameters with large gradients get smaller steps; parameters with small gradients get larger steps. Prevents the optimizer from moving too fast along directions with high curvature.
cache = 0
for batch in batches:
gradient = compute_gradient(batch, weights)
cache = β * cache + (1 - β) * gradient²
weights = weights - (α / (sqrt(cache) + ε)) * gradient
Adam (Adaptive Moment Estimation)
Combines momentum (first moment — mean of gradients) and RMSProp (second moment — variance of gradients). The dominant optimizer for deep learning.
m = 0 # first moment (momentum)
v = 0 # second moment (variance)
t = 0 # time step
for batch in batches:
t += 1
gradient = compute_gradient(batch, weights)
m = β1 * m + (1 - β1) * gradient # β1 ≈ 0.9
v = β2 * v + (1 - β2) * gradient² # β2 ≈ 0.999
# Bias correction (important for early steps when m, v ≈ 0)
m_hat = m / (1 - β1**t)
v_hat = v / (1 - β2**t)
weights = weights - α * m_hat / (sqrt(v_hat) + ε) # ε ≈ 1e-8
Adam’s advantage: Works well with default hyperparameters (α=0.001, β1=0.9, β2=0.999) across a wide range of models. You rarely need to tune it much to get good initial results.
SGD vs. Adam: Which to Use?
This is a common follow-up question. The answer is nuanced:
Adam: Converges faster. Better default choice for most deep learning tasks, especially early in training or when tuning other hyperparameters. PyTorch and TensorFlow default to Adam for most tutorials.
SGD with momentum: Often achieves better final accuracy than Adam when the learning rate is carefully tuned with a schedule (cosine annealing, learning rate warmup). Many state-of-the-art image classification models (ResNet, EfficientNet) are trained with SGD + momentum + learning rate schedule. The added tuning work is worth it for production models where you care about squeezing out the last 0.5% accuracy.
Practical guidance:
- Prototyping / early experiments → Adam. Fast convergence, works out of the box.
- Fine-tuning a production model → Adam or SGD with a learning rate schedule.
- NLP / Transformers → Adam (or AdamW, which adds weight decay properly).
- Computer vision training from scratch → SGD + momentum is still competitive.
Learning Rate Schedules
A fixed learning rate is rarely optimal. Common schedules:
- Step decay: Halve the learning rate every N epochs.
- Cosine annealing: Gradually decrease α following a cosine curve. Widely used for image models.
- Warmup + decay: Start with a tiny learning rate, increase linearly for the first few thousand steps (warmup), then decay. Standard for Transformers — prevents unstable early updates when gradients are large.
- Cyclical learning rates (CLR): Oscillate between a min and max learning rate. Can escape local minima.
Why Does Training Diverge?
Common reasons and fixes:
- Learning rate too high: Loss explodes or NaN. Fix: reduce α by 10×.
- Vanishing gradients: Gradients shrink to near-zero in early layers (deep networks, saturating activations like sigmoid). Fix: use ReLU activations, batch normalization, residual connections.
- Exploding gradients: Gradients grow exponentially (especially in RNNs). Fix: gradient clipping (
clip_grad_norm). - Bad weight initialization: Symmetry breaking fails; all neurons learn the same thing. Fix: Xavier/He initialization.
Interview Questions You Should Be Ready For
Q: What is the difference between batch GD, SGD, and mini-batch GD?
Batch size. Batch GD uses all examples (slow, stable). SGD uses one example (fast, noisy). Mini-batch is the practical middle ground used in production.
Q: Why use Adam over SGD?
Adam adapts learning rates per parameter and converges faster with less tuning. SGD with a good learning rate schedule can match or beat Adam’s final accuracy but requires more work.
Q: What is the learning rate warmup used in Transformers?
At the start of training, weights are random and gradients are large and noisy. A small learning rate during warmup prevents large, destructive early updates. After warmup, the learning rate increases and then decays.
Q: What’s the difference between a local minimum and a saddle point?
A local minimum has positive curvature in all directions. A saddle point has positive curvature in some directions and negative in others — gradient descent can get stuck there. In high-dimensional spaces, saddle points are more common than true local minima. Momentum and noise (from mini-batches) help escape them.
Summary
Gradient descent finds model parameters that minimize loss by repeatedly stepping in the direction of the negative gradient. Mini-batch SGD is the practical standard — it balances computation efficiency with gradient accuracy. Adam is the default optimizer for most deep learning tasks because it adapts learning rates automatically and converges quickly. SGD with momentum and a learning rate schedule remains competitive for vision models when you’re willing to tune carefully. Know the failure modes — vanishing/exploding gradients, divergence, saddle points — and their standard fixes.
See also: How Backpropagation Works — the algorithm that computes the gradients that SGD and Adam consume each step.