Backpropagation is the algorithm that makes neural network training possible. Understanding it at the level that interviews require — not just “it computes gradients” but how and why — is essential for ML engineering and research roles. This guide covers the math, the intuition, common interview questions, and implementation.
The Core Idea in One Sentence
Backpropagation computes the gradient of the loss with respect to every weight in the network by applying the chain rule of calculus layer by layer, moving backward from the output.
Forward Pass
During the forward pass, data flows from input to output. Each layer applies a linear transformation followed by a non-linear activation:
import numpy as np
def relu(x):
return np.maximum(0, x)
def relu_grad(x):
return (x > 0).astype(float)
class Linear:
def __init__(self, in_dim, out_dim):
# He initialization for ReLU
self.W = np.random.randn(in_dim, out_dim) * np.sqrt(2.0 / in_dim)
self.b = np.zeros(out_dim)
def forward(self, x):
self.x = x # cache for backward pass
return x @ self.W + self.b # (batch, out_dim)
At the output layer, compute the loss. For classification with cross-entropy:
def softmax(z):
z = z - z.max(axis=1, keepdims=True) # numerical stability
e = np.exp(z)
return e / e.sum(axis=1, keepdims=True)
def cross_entropy_loss(y_pred_probs, y_true):
# y_true: one-hot encoded
n = y_pred_probs.shape[0]
return -np.sum(y_true * np.log(y_pred_probs + 1e-8)) / n
The Chain Rule
The chain rule says: if L = f(g(x)), then dL/dx = (dL/df) * (df/dg) * (dg/dx).
In a neural network with layers A → B → C → Loss, the gradient of the loss with respect to A is:
dL/dA = dL/dC * dC/dB * dB/dA
Backpropagation computes this efficiently by reusing gradients already computed further in the network. Each layer only needs to compute its local gradient and multiply by the upstream gradient (the gradient flowing in from the layer above).
Backward Pass
class Linear:
def backward(self, d_out):
# d_out: gradient of loss w.r.t. this layer output, shape (batch, out_dim)
# Gradient w.r.t. weights: d_out transposed times input
self.dW = self.x.T @ d_out # (in_dim, out_dim)
self.db = d_out.sum(axis=0) # (out_dim,)
# Gradient w.r.t. input (to pass to previous layer)
d_input = d_out @ self.W.T # (batch, in_dim)
return d_input
Full Two-Layer Network Example
class TwoLayerNet:
def __init__(self, input_dim, hidden_dim, output_dim, lr=0.01):
self.fc1 = Linear(input_dim, hidden_dim)
self.fc2 = Linear(hidden_dim, output_dim)
self.lr = lr
def forward(self, X, y):
# Layer 1
z1 = self.fc1.forward(X)
a1 = relu(z1)
self.z1, self.a1 = z1, a1
# Layer 2
z2 = self.fc2.forward(a1)
probs = softmax(z2)
self.probs = probs
loss = cross_entropy_loss(probs, y)
return loss
def backward(self, y):
# Gradient of cross-entropy + softmax (combined formula)
n = y.shape[0]
d_z2 = (self.probs - y) / n # (batch, output_dim)
# Backprop through fc2
d_a1 = self.fc2.backward(d_z2)
# Backprop through ReLU
d_z1 = d_a1 * relu_grad(self.z1)
# Backprop through fc1
self.fc1.backward(d_z1)
def update(self):
# SGD weight update
self.fc1.W -= self.lr * self.fc1.dW
self.fc1.b -= self.lr * self.fc1.db
self.fc2.W -= self.lr * self.fc2.dW
self.fc2.b -= self.lr * self.fc2.db
Automatic Differentiation (How PyTorch Does It)
Modern frameworks use automatic differentiation — they build a computation graph during the forward pass and traverse it in reverse during backward. You do not implement backprop manually.
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
# Training loop
for X, y in dataloader:
optimizer.zero_grad() # clear previous gradients
logits = model(X) # forward pass
loss = criterion(logits, y) # compute loss
loss.backward() # backpropagation (computes all gradients)
optimizer.step() # update weights using computed gradients
Vanishing and Exploding Gradients
Vanishing Gradients
With sigmoid or tanh activations, the gradient of the activation function saturates near 0 at large inputs. When multiplied across many layers, the product of small values exponentially decreases to zero — early layers learn extremely slowly.
# Sigmoid gradient at saturation
import numpy as np
def sigmoid_grad(x):
s = 1 / (1 + np.exp(-x))
return s * (1 - s)
# At x=5: gradient ≈ 0.007 — nearly zero
# At x=10: gradient ≈ 0.00005 — effectively zero
# After 10 layers of sigmoid: 0.007^10 ≈ 3e-21
print(sigmoid_grad(5)) # 0.0066
Solutions:
- Use ReLU (gradient = 1 for positive values) — no saturation in the positive range
- Batch Normalization — normalizes layer inputs, keeps activations in a useful range
- Residual connections (ResNet) — gradients flow directly through skip connections
- LSTM/GRU gating for RNNs — gates control gradient flow over time
Exploding Gradients
If weights are initialized too large or the network is very deep, gradients can grow exponentially. Symptoms: loss becomes NaN, weights diverge.
Solutions: gradient clipping (torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)), careful weight initialization (Xavier/He), batch normalization.
Weight Initialization Matters
| Activation | Initialization | Formula |
|---|---|---|
| Sigmoid / tanh | Xavier (Glorot) | Uniform in [-sqrt(6/(n_in+n_out)), sqrt(6/(n_in+n_out))] |
| ReLU | He (Kaiming) | Normal with std = sqrt(2 / n_in) |
| Linear output | Small uniform | Uniform in [-0.1, 0.1] |
Interview Questions and Answers
Why do we need non-linear activations?
Without non-linearities, stacking linear layers is equivalent to a single linear layer. The composition of linear functions is linear. Non-linear activations (ReLU, sigmoid, tanh) allow the network to learn complex, non-linear decision boundaries — universal function approximation requires non-linearity.
What is the difference between gradient descent, SGD, and Adam?
Gradient descent computes gradients on the entire dataset before updating — impractical for large datasets. Stochastic Gradient Descent (SGD) updates weights after each sample (or mini-batch), introducing noise that can help escape local minima. Adam (Adaptive Moment Estimation) maintains a running average of past gradients (momentum) and past squared gradients (adaptive learning rate) — it adapts the learning rate per parameter and typically converges faster with less tuning than vanilla SGD.
What happens if you set the learning rate too high or too low?
Too high: weights overshoot the minimum, loss oscillates or diverges (NaN). Too low: training converges very slowly, may get stuck in a local minimum, training takes too long to be practical. Learning rate warmup (start low, increase to target rate over the first few epochs) stabilizes early training when gradients are large and noisy.
What is a computational graph and why does it matter for backprop?
A computational graph is a directed acyclic graph where nodes represent operations and edges represent data (tensors) flowing between them. PyTorch builds this graph dynamically during the forward pass. During backward(), it traverses the graph in reverse, calling each operation backward function with the upstream gradient. This design allows arbitrary model architectures to be differentiated automatically — the user never needs to implement gradients manually.
Frequently Asked Questions
What is backpropagation and how does it work?
Backpropagation is the algorithm for computing gradients in a neural network. It has two phases: (1) Forward pass — data flows from input through each layer, computing activations, until the output layer produces predictions. A loss function compares predictions to ground truth. (2) Backward pass — starting from the loss, the chain rule of calculus computes the gradient of the loss with respect to each weight, layer by layer, moving backward through the network. Each layer computes its local gradient and multiplies by the upstream gradient flowing in from the layer ahead. These gradients are then used by an optimizer (SGD, Adam) to update weights in the direction that reduces the loss.
What is the vanishing gradient problem and how is it solved?
Vanishing gradients occur when gradients shrink exponentially as they propagate backward through deep networks. With sigmoid or tanh activations, the gradient of the activation is near zero at saturated inputs (large positive or negative values). Multiplied across many layers, gradients become effectively zero, preventing early layers from learning. Solutions: (1) ReLU activation — gradient is 1 for positive inputs, not saturated; (2) Residual connections (ResNet) — skip connections allow gradients to flow directly without passing through activation functions; (3) Batch normalization — normalizes layer inputs to prevent saturation; (4) LSTM/GRU gating for recurrent networks — learned gates control gradient flow over time.
What is the difference between gradient descent, SGD, and Adam?
Gradient descent computes the exact gradient on the entire dataset before each weight update — prohibitively slow for large datasets. Stochastic Gradient Descent (SGD) approximates the gradient using a single sample or mini-batch per update, introducing noise that can help escape local minima and dramatically speeds up training. Adam (Adaptive Moment Estimation) improves on SGD by maintaining exponential moving averages of past gradients (momentum) and past squared gradients (adaptive per-parameter learning rates). Adam typically converges faster with less learning rate tuning and is the default optimizer for most deep learning tasks. SGD with momentum and careful learning rate schedules can match or beat Adam on some tasks like image classification.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is backpropagation and how does it work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Backpropagation is the algorithm for computing gradients in a neural network. It has two phases: (1) Forward pass — data flows from input through each layer, computing activations, until the output layer produces predictions. A loss function compares predictions to ground truth. (2) Backward pass — starting from the loss, the chain rule of calculus computes the gradient of the loss with respect to each weight, layer by layer, moving backward through the network. Each layer computes its local gradient and multiplies by the upstream gradient flowing in from the layer ahead. These gradients are then used by an optimizer (SGD, Adam) to update weights in the direction that reduces the loss.”
}
},
{
“@type”: “Question”,
“name”: “What is the vanishing gradient problem and how is it solved?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Vanishing gradients occur when gradients shrink exponentially as they propagate backward through deep networks. With sigmoid or tanh activations, the gradient of the activation is near zero at saturated inputs (large positive or negative values). Multiplied across many layers, gradients become effectively zero, preventing early layers from learning. Solutions: (1) ReLU activation — gradient is 1 for positive inputs, not saturated; (2) Residual connections (ResNet) — skip connections allow gradients to flow directly without passing through activation functions; (3) Batch normalization — normalizes layer inputs to prevent saturation; (4) LSTM/GRU gating for recurrent networks — learned gates control gradient flow over time.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between gradient descent, SGD, and Adam?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Gradient descent computes the exact gradient on the entire dataset before each weight update — prohibitively slow for large datasets. Stochastic Gradient Descent (SGD) approximates the gradient using a single sample or mini-batch per update, introducing noise that can help escape local minima and dramatically speeds up training. Adam (Adaptive Moment Estimation) improves on SGD by maintaining exponential moving averages of past gradients (momentum) and past squared gradients (adaptive per-parameter learning rates). Adam typically converges faster with less learning rate tuning and is the default optimizer for most deep learning tasks. SGD with momentum and careful learning rate schedules can match or beat Adam on some tasks like image classification.”
}
}
]
}