How Does Backpropagation Work? A Complete Interview Guide

⏱ 8 min read

Backpropagation is the algorithm that makes neural network training possible. Understanding it at the level that interviews require — not just “it computes gradients” but how and why — is essential for ML engineering and research roles. This guide covers the math, the intuition, common interview questions, and implementation.

The Core Idea in One Sentence

Backpropagation computes the gradient of the loss with respect to every weight in the network by applying the chain rule of calculus layer by layer, moving backward from the output.

Forward Pass

During the forward pass, data flows from input to output. Each layer applies a linear transformation followed by a non-linear activation:

import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_grad(x):
    return (x > 0).astype(float)

class Linear:
    def __init__(self, in_dim, out_dim):
        # He initialization for ReLU
        self.W = np.random.randn(in_dim, out_dim) * np.sqrt(2.0 / in_dim)
        self.b = np.zeros(out_dim)

    def forward(self, x):
        self.x = x                    # cache for backward pass
        return x @ self.W + self.b    # (batch, out_dim)

At the output layer, compute the loss. For classification with cross-entropy:

def softmax(z):
    z = z - z.max(axis=1, keepdims=True)  # numerical stability
    e = np.exp(z)
    return e / e.sum(axis=1, keepdims=True)

def cross_entropy_loss(y_pred_probs, y_true):
    # y_true: one-hot encoded
    n = y_pred_probs.shape[0]
    return -np.sum(y_true * np.log(y_pred_probs + 1e-8)) / n

The Chain Rule

The chain rule says: if L = f(g(x)), then dL/dx = (dL/df) * (df/dg) * (dg/dx).

In a neural network with layers A → B → C → Loss, the gradient of the loss with respect to A is:

dL/dA = dL/dC * dC/dB * dB/dA

Backpropagation computes this efficiently by reusing gradients already computed further in the network. Each layer only needs to compute its local gradient and multiply by the upstream gradient (the gradient flowing in from the layer above).

Backward Pass

class Linear:
    def backward(self, d_out):
        # d_out: gradient of loss w.r.t. this layer output, shape (batch, out_dim)

        # Gradient w.r.t. weights: d_out transposed times input
        self.dW = self.x.T @ d_out          # (in_dim, out_dim)
        self.db = d_out.sum(axis=0)          # (out_dim,)

        # Gradient w.r.t. input (to pass to previous layer)
        d_input = d_out @ self.W.T          # (batch, in_dim)
        return d_input

Full Two-Layer Network Example

class TwoLayerNet:
    def __init__(self, input_dim, hidden_dim, output_dim, lr=0.01):
        self.fc1 = Linear(input_dim, hidden_dim)
        self.fc2 = Linear(hidden_dim, output_dim)
        self.lr = lr

    def forward(self, X, y):
        # Layer 1
        z1 = self.fc1.forward(X)
        a1 = relu(z1)
        self.z1, self.a1 = z1, a1

        # Layer 2
        z2 = self.fc2.forward(a1)
        probs = softmax(z2)
        self.probs = probs

        loss = cross_entropy_loss(probs, y)
        return loss

    def backward(self, y):
        # Gradient of cross-entropy + softmax (combined formula)
        n = y.shape[0]
        d_z2 = (self.probs - y) / n           # (batch, output_dim)

        # Backprop through fc2
        d_a1 = self.fc2.backward(d_z2)

        # Backprop through ReLU
        d_z1 = d_a1 * relu_grad(self.z1)

        # Backprop through fc1
        self.fc1.backward(d_z1)

    def update(self):
        # SGD weight update
        self.fc1.W -= self.lr * self.fc1.dW
        self.fc1.b -= self.lr * self.fc1.db
        self.fc2.W -= self.lr * self.fc2.dW
        self.fc2.b -= self.lr * self.fc2.db

Automatic Differentiation (How PyTorch Does It)

Modern frameworks use automatic differentiation — they build a computation graph during the forward pass and traverse it in reverse during backward. You do not implement backprop manually.

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Training loop
for X, y in dataloader:
    optimizer.zero_grad()        # clear previous gradients
    logits = model(X)            # forward pass
    loss = criterion(logits, y)  # compute loss
    loss.backward()              # backpropagation (computes all gradients)
    optimizer.step()             # update weights using computed gradients

Vanishing and Exploding Gradients

Vanishing Gradients

With sigmoid or tanh activations, the gradient of the activation function saturates near 0 at large inputs. When multiplied across many layers, the product of small values exponentially decreases to zero — early layers learn extremely slowly.

# Sigmoid gradient at saturation
import numpy as np

def sigmoid_grad(x):
    s = 1 / (1 + np.exp(-x))
    return s * (1 - s)

# At x=5: gradient ≈ 0.007 — nearly zero
# At x=10: gradient ≈ 0.00005 — effectively zero
# After 10 layers of sigmoid: 0.007^10 ≈ 3e-21
print(sigmoid_grad(5))  # 0.0066

Solutions:

Use ReLU (gradient = 1 for positive values) — no saturation in the positive range
Batch Normalization — normalizes layer inputs, keeps activations in a useful range
Residual connections (ResNet) — gradients flow directly through skip connections
LSTM/GRU gating for RNNs — gates control gradient flow over time

Exploding Gradients

If weights are initialized too large or the network is very deep, gradients can grow exponentially. Symptoms: loss becomes NaN, weights diverge.

Solutions: gradient clipping (torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)), careful weight initialization (Xavier/He), batch normalization.

Weight Initialization Matters

Activation	Initialization	Formula
Sigmoid / tanh	Xavier (Glorot)	Uniform in [-sqrt(6/(n_in+n_out)), sqrt(6/(n_in+n_out))]
ReLU	He (Kaiming)	Normal with std = sqrt(2 / n_in)
Linear output	Small uniform	Uniform in [-0.1, 0.1]

Interview Questions and Answers

Why do we need non-linear activations?

Without non-linearities, stacking linear layers is equivalent to a single linear layer. The composition of linear functions is linear. Non-linear activations (ReLU, sigmoid, tanh) allow the network to learn complex, non-linear decision boundaries — universal function approximation requires non-linearity.

What is the difference between gradient descent, SGD, and Adam?

Gradient descent computes gradients on the entire dataset before updating — impractical for large datasets. Stochastic Gradient Descent (SGD) updates weights after each sample (or mini-batch), introducing noise that can help escape local minima. Adam (Adaptive Moment Estimation) maintains a running average of past gradients (momentum) and past squared gradients (adaptive learning rate) — it adapts the learning rate per parameter and typically converges faster with less tuning than vanilla SGD.

What happens if you set the learning rate too high or too low?

Too high: weights overshoot the minimum, loss oscillates or diverges (NaN). Too low: training converges very slowly, may get stuck in a local minimum, training takes too long to be practical. Learning rate warmup (start low, increase to target rate over the first few epochs) stabilizes early training when gradients are large and noisy.

What is a computational graph and why does it matter for backprop?

A computational graph is a directed acyclic graph where nodes represent operations and edges represent data (tensors) flowing between them. PyTorch builds this graph dynamically during the forward pass. During backward(), it traverses the graph in reverse, calling each operation backward function with the upstream gradient. This design allows arbitrary model architectures to be differentiated automatically — the user never needs to implement gradients manually.

Frequently Asked Questions

What is backpropagation and how does it work?

Backpropagation is the algorithm for computing gradients in a neural network. It has two phases: (1) Forward pass — data flows from input through each layer, computing activations, until the output layer produces predictions. A loss function compares predictions to ground truth. (2) Backward pass — starting from the loss, the chain rule of calculus computes the gradient of the loss with respect to each weight, layer by layer, moving backward through the network. Each layer computes its local gradient and multiplies by the upstream gradient flowing in from the layer ahead. These gradients are then used by an optimizer (SGD, Adam) to update weights in the direction that reduces the loss.

What is the vanishing gradient problem and how is it solved?

Vanishing gradients occur when gradients shrink exponentially as they propagate backward through deep networks. With sigmoid or tanh activations, the gradient of the activation is near zero at saturated inputs (large positive or negative values). Multiplied across many layers, gradients become effectively zero, preventing early layers from learning. Solutions: (1) ReLU activation — gradient is 1 for positive inputs, not saturated; (2) Residual connections (ResNet) — skip connections allow gradients to flow directly without passing through activation functions; (3) Batch normalization — normalizes layer inputs to prevent saturation; (4) LSTM/GRU gating for recurrent networks — learned gates control gradient flow over time.

What is the difference between gradient descent, SGD, and Adam?

Gradient descent computes the exact gradient on the entire dataset before each weight update — prohibitively slow for large datasets. Stochastic Gradient Descent (SGD) approximates the gradient using a single sample or mini-batch per update, introducing noise that can help escape local minima and dramatically speeds up training. Adam (Adaptive Moment Estimation) improves on SGD by maintaining exponential moving averages of past gradients (momentum) and past squared gradients (adaptive per-parameter learning rates). Adam typically converges faster with less learning rate tuning and is the default optimizer for most deep learning tasks. SGD with momentum and careful learning rate schedules can match or beat Adam on some tasks like image classification.