Understanding how neural networks learn is fundamental to any ML engineering interview. This guide covers the training process from forward pass to backpropagation, optimization algorithms, and regularization techniques that prevent overfitting — with the mathematical intuition and practical knowledge expected at ML-focused companies.
Forward Pass and Loss Computation
Training a neural network is an optimization problem: minimize a loss function that measures how wrong the model predictions are. Forward pass: input data flows through the network layers. Each layer computes: output = activation(weights * input + bias). The final layer produces predictions. Loss computation: compare predictions with ground truth labels. Common loss functions: (1) Cross-entropy loss (classification): L = -sum(y_true * log(y_pred)). Measures the divergence between predicted probabilities and true labels. For binary classification: L = -(y*log(p) + (1-y)*log(1-p)). (2) Mean squared error (regression): L = (1/N) * sum((y_pred – y_true)^2). Measures the average squared difference between predictions and targets. (3) Contrastive loss (embedding learning): pulls similar examples closer and pushes dissimilar examples apart in embedding space. The loss is a single scalar value summarizing “how wrong is the model on this batch of data.” The goal of training: adjust weights to minimize this loss.
Backpropagation
Backpropagation computes the gradient of the loss with respect to every weight in the network. This tells us: for each weight, how much does the loss change if we nudge this weight? The chain rule decomposes the gradient through layers: dL/dw = dL/dy * dy/dz * dz/dw, where y is the output, z is the pre-activation, and w is the weight. Starting from the loss (the output), gradients flow backward through each layer. Each layer computes: (1) Gradient of the loss with respect to its output (received from the layer above). (2) Gradient with respect to its weights (used to update weights). (3) Gradient with respect to its input (passed to the layer below). This recursive process efficiently computes all gradients in one backward pass. Time: O(same as forward pass) — backpropagation is not expensive relative to the forward pass. Automatic differentiation (PyTorch autograd, TensorFlow GradientTape): modern frameworks build a computational graph during the forward pass and automatically compute gradients during the backward pass. The engineer defines the forward computation; the framework handles backpropagation. Gradient issues: (1) Vanishing gradients — gradients become tiny in deep networks with sigmoid/tanh activations. Solution: use ReLU, residual connections, batch normalization. (2) Exploding gradients — gradients become enormous. Solution: gradient clipping (cap the gradient norm).
Optimization Algorithms
After computing gradients, an optimizer updates the weights. (1) SGD (Stochastic Gradient Descent): w = w – learning_rate * gradient. Simple but sensitive to learning rate and can oscillate in ravines (different scales across dimensions). (2) SGD with momentum: maintain a velocity term that accumulates past gradients. v = momentum * v – lr * gradient. w = w + v. Dampens oscillations and accelerates convergence in consistent gradient directions. (3) Adam (Adaptive Moment Estimation): the default optimizer for most deep learning. Maintains: first moment (mean of gradients, like momentum) and second moment (mean of squared gradients, for per-parameter learning rate scaling). w = w – lr * m_hat / (sqrt(v_hat) + epsilon). Parameters with large gradients get smaller effective learning rates (stabilizes training). Parameters with small gradients get larger effective learning rates (explores more). (4) AdamW: Adam with decoupled weight decay. The standard in modern LLM training. Learning rate schedule: (1) Warmup — start with a very small learning rate and increase linearly for the first N steps. Prevents the model from making large, incorrect weight updates early in training when the loss landscape is steep. (2) Cosine decay — after warmup, decay the learning rate following a cosine curve to near-zero. This is the standard schedule for LLM pre-training.
Batch Normalization
Batch normalization normalizes the activations within a layer across the batch dimension. For each feature: compute the mean and variance across the batch, normalize to zero mean and unit variance, then scale and shift with learned parameters gamma and beta. BN(x) = gamma * (x – mean) / sqrt(var + epsilon) + beta. Why it helps: (1) Reduces internal covariate shift — as weights in earlier layers change during training, the distribution of inputs to later layers shifts. BN stabilizes these distributions, allowing higher learning rates and faster convergence. (2) Regularization effect — the batch statistics introduce noise (each batch has slightly different mean/variance), acting as a mild regularizer. During inference: batch statistics are replaced by running averages computed during training (no dependence on the current batch). Layer Normalization (used in Transformers): normalizes across the feature dimension instead of the batch dimension. LN(x) = gamma * (x – mean_features) / sqrt(var_features + epsilon) + beta. Independent of batch size — works with batch size 1 and variable-length sequences. This is why Transformers use Layer Norm, not Batch Norm.
Regularization: Preventing Overfitting
Overfitting: the model memorizes training data but fails to generalize to unseen data. Training loss decreases but validation loss increases. Regularization techniques: (1) Dropout — during training, randomly set a fraction p (typically 0.1-0.5) of activations to zero. This prevents co-adaptation: neurons cannot rely on specific other neurons being active. At inference, all neurons are active but outputs are scaled by (1-p). Dropout is the most widely used regularizer in deep learning. (2) Weight decay (L2 regularization) — add lambda * sum(w^2) to the loss. Penalizes large weights, encouraging the model to use smaller, more distributed weights. In Adam: use AdamW (decoupled weight decay) rather than L2 in the loss (they are different with adaptive optimizers). (3) Early stopping — monitor validation loss during training. Stop when validation loss starts increasing (the model is beginning to overfit). Restore the weights from the epoch with the lowest validation loss. Simple and effective. (4) Data augmentation — artificially expand the training set: random crops, flips, rotations for images. Random masking, synonym replacement for text. More training data is the best regularizer. (5) Label smoothing — instead of hard labels (0 or 1), use soft labels (0.1 and 0.9). Prevents the model from becoming overconfident. Used in Transformer training.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does backpropagation compute gradients through a neural network?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Backpropagation applies the chain rule recursively from the loss backward through each layer. For each weight: dL/dw = dL/dy * dy/dz * dz/dw (loss gradient * activation gradient * weight gradient). Starting from the loss, each layer computes: gradient of loss w.r.t. its output (from above), gradient w.r.t. its weights (for weight updates), and gradient w.r.t. its input (passed to the layer below). This computes all gradients in one backward pass — same computational cost as the forward pass. Modern frameworks (PyTorch autograd) build a computational graph during forward pass and automatically compute gradients during backward. Engineers define forward computation; the framework handles backpropagation. Key issues: vanishing gradients (use ReLU, residual connections, batch norm) and exploding gradients (gradient clipping).”}},{“@type”:”Question”,”name”:”Why is Adam the default optimizer and how does it work?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Adam (Adaptive Moment Estimation) maintains two statistics per parameter: first moment m (running mean of gradients, like momentum) and second moment v (running mean of squared gradients, for adaptive learning rates). Update: w = w – lr * m_hat / (sqrt(v_hat) + epsilon). Parameters with large gradients get smaller effective learning rates (stabilizes training). Parameters with small gradients get larger rates (explores more). This adapts to each parameter individually, unlike SGD with a single learning rate. AdamW (decoupled weight decay) is the modern standard for LLM training. Learning rate schedule: warmup (low->high over first N steps, prevents early large updates) followed by cosine decay (high->near-zero, fine-tunes toward convergence). This is the standard schedule for Transformer pre-training.”}}]}