How Backpropagation Works: The Chain Rule in Neural Networks

Backpropagation is the algorithm that makes training deep neural networks possible. Every interviewer for ML engineering or research roles expects you to explain it. But what they’re actually testing is whether you understand the chain rule, why vanishing gradients happen, and how modern architectures (ReLU, residual connections) address them — not just that you’ve memorized “it computes gradients.”

Strategy

Start with the goal: we want to know how much each weight contributed to the error so we can adjust it. Then explain how the chain rule makes this tractable even for networks with billions of parameters. Finish with the practical implications: vanishing/exploding gradients and how they’re fixed.

The Goal: Computing Gradients

Training a neural network means finding weights that minimize the loss function. To minimize with gradient descent, we need the gradient of the loss with respect to every weight:

∂L/∂w_ij  for every weight w_ij in the network

A neural network with 10 layers and millions of weights has millions of these gradients to compute. Backpropagation does this efficiently in a single backward pass using the chain rule.

The Chain Rule

If z = f(y) and y = g(x), then:

dz/dx = dz/dy × dy/dx

In a neural network, the loss L depends on the output, which depends on layer N, which depends on layer N-1, which depends on… the first weight. The chain rule lets us decompose this long dependency into a product of local gradients — each layer only needs to know its own gradient and the gradient flowing in from the layer above.

Forward Pass

During the forward pass, compute activations layer by layer and store them. These stored activations are needed during the backward pass.

import numpy as np

def forward(X, weights, biases):
    cache = {'A0': X}   # input layer

    A = X
    for l, (W, b) in enumerate(zip(weights, biases), 1):
        Z = A @ W + b                          # linear: Z = AW + b
        A = np.maximum(0, Z)                   # ReLU activation
        cache[f'Z{l}'] = Z
        cache[f'A{l}'] = A

    return A, cache   # return output and ALL intermediate activations

The final output A is compared to the true label Y to compute the loss:

def compute_loss(A_out, Y):
    # Cross-entropy loss for classification
    m = Y.shape[0]
    loss = -np.sum(Y * np.log(A_out + 1e-9)) / m
    return loss

Backward Pass

Starting from the loss, propagate gradients backward through each layer using the chain rule.

def backward(Y, weights, cache):
    gradients = {}
    m = Y.shape[0]
    L = len(weights)

    # Gradient of loss w.r.t. final layer output
    dA = -(Y / (cache[f'A{L}'] + 1e-9))

    for l in range(L, 0, -1):
        A_prev = cache[f'A{l-1}']
        Z      = cache[f'Z{l}']
        W      = weights[l-1]

        # Gradient through ReLU activation: dReLU/dZ = 1 if Z > 0 else 0
        dZ = dA * (Z > 0).astype(float)

        # Gradient w.r.t. weights, biases, and previous layer activation
        dW = A_prev.T @ dZ / m           # ∂L/∂W_l
        db = np.sum(dZ, axis=0) / m      # ∂L/∂b_l
        dA = dZ @ W.T                    # ∂L/∂A_{l-1}  (chain rule: pass gradient back)

        gradients[f'dW{l}'] = dW
        gradients[f'db{l}'] = db

    return gradients

The key operation is dA = dZ @ W.T — this is the chain rule in matrix form, passing the gradient backward through the weight matrix.

Why the Chain Rule Scales

Without backpropagation, you’d compute each gradient numerically: perturb weight w by ε, run the forward pass twice, compute (L(w+ε) - L(w-ε)) / 2ε. That’s O(P) forward passes for P parameters. A 100M-parameter model would require 100M forward passes per gradient update — completely infeasible.

Backpropagation computes all P gradients in a single backward pass. It’s the same computational cost as one forward pass (with a constant factor of ~2–3×). This is why deep learning at scale is possible at all.

Vanishing Gradients

The chain rule multiplies gradients across layers. If the gradients in each layer are < 1, multiplying them across 50 layers means the gradient for the first layer is effectively zero.

Layer 50 gradient: 0.5
Layer 49 gradient: 0.5 × 0.5 = 0.25
Layer 48 gradient: 0.25 × 0.5 = 0.125
...
Layer 1 gradient:  0.5^50 ≈ 0.0000000000000009  ← vanished

Why sigmoid/tanh are problematic: Sigmoid’s derivative is at most 0.25. In a 20-layer sigmoid network, the gradient at layer 1 is at most 0.25^20 ≈ 10^-12. The first layers learn nothing.

Fixes:

ReLU activation: Derivative is exactly 1 for positive inputs — no gradient shrinkage through the activation. This is the primary reason ReLU replaced sigmoid in deep networks (Krizhevsky et al., AlexNet, 2012).
Residual connections (ResNets): Add a skip connection that bypasses layers: output = F(x) + x. The gradient flows directly through the skip connection unchanged — even if F(x)‘s gradient vanishes, the identity path carries the gradient to early layers. This is why ResNets can train 100+ layer networks.
Batch normalization: Normalizes layer activations, keeping them in a range where gradients don’t vanish or explode.
Careful weight initialization: He initialization for ReLU sets initial weights to preserve variance across layers, preventing vanishing gradients from the start.
Gradient clipping: For exploding gradients (especially in RNNs), cap the gradient norm: if ||grad|| > threshold: grad = grad × (threshold / ||grad||).

Exploding Gradients

The opposite problem: gradients multiply to values >> 1, causing weight updates to overshoot catastrophically. More common in RNNs (which unroll deep sequences) than feedforward networks.

Layer 1 gradient: 3.0
Layer 2 gradient: 3.0 × 3.0 = 9.0
...
Layer 10 gradient: 3.0^10 = 59,049  ← NaN or wildly wrong weights

Gradient clipping is the standard fix:

# PyTorch
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Call before optimizer.step()

Computational Graph and Automatic Differentiation

Modern frameworks (PyTorch, JAX) implement backpropagation via automatic differentiation — you define the forward computation and the framework automatically builds a computational graph and computes gradients.

import torch

x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 + 3 * x + 1      # forward pass builds graph: y = x² + 3x + 1
y.backward()                  # backprop: computes dy/dx = 2x + 3 = 7
print(x.grad)                 # tensor([7.0])

PyTorch’s autograd traces operations on tensors and builds a dynamic computational graph. .backward() traverses it in reverse, applying the chain rule at each operation. You never write the backward pass manually for real networks — the framework handles it. But understanding what it’s doing is what interviewers test.

Common Interview Questions

Q: Explain backpropagation in one sentence.
Backpropagation uses the chain rule to efficiently compute the gradient of the loss with respect to every weight in one backward pass, enabling gradient descent to train deep networks.

Q: Why did deep networks fail to train before ReLU and residual connections?
Sigmoid activations caused vanishing gradients — the gradient at early layers shrank to near zero during backpropagation, so those layers never learned. ReLU fixed the activation problem; residual connections fixed the depth problem.

Q: Why do we need to store activations during the forward pass?
The backward pass needs intermediate activations to compute gradients (e.g., the ReLU gradient needs to know which inputs were positive). Storing them is the memory cost of backpropagation — this is why large batch sizes and deep models require a lot of GPU memory.

Q: What’s the difference between backpropagation and gradient descent?
Backpropagation computes the gradients. Gradient descent uses the gradients to update the weights. Backpropagation is the algorithm; gradient descent (or Adam, SGD, etc.) is the optimizer that applies the result.

Summary

Backpropagation computes gradients for every weight in a neural network in one backward pass using the chain rule. The forward pass computes and caches activations; the backward pass uses those cached values to propagate gradients from output to input. Vanishing gradients — caused by multiplying small numbers across many layers — are solved by ReLU activations, residual connections, and batch normalization. Modern frameworks implement automatic differentiation, but understanding the chain rule mechanics is what ML interviewers actually test.