Principle:Fastai Fastbook Backpropagation
| Knowledge Sources | |
|---|---|
| Domains | Deep Learning, Calculus, Automatic Differentiation |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Backpropagation is the algorithm for computing the gradient of a loss function with respect to every parameter in a neural network by systematically applying the chain rule of calculus from the output layer backward through each preceding layer.
Description
A neural network is a composition of functions: the output of one layer feeds into the next. To train the network with gradient descent, we need the derivative of the final loss with respect to every weight and bias in every layer. Computing these derivatives directly would be intractable for large networks.
Backpropagation solves this by exploiting the chain rule: rather than differentiating the entire composed function at once, it computes gradients layer by layer in reverse order. Each layer receives the gradient of the loss with respect to its output (from the layer above), and uses it to compute:
- The gradient of the loss with respect to its inputs (passed to the layer below).
- The gradient of the loss with respect to its parameters (used to update weights).
This two-pass structure (forward pass to compute outputs, backward pass to compute gradients) is the foundation of all modern deep learning training.
Usage
Backpropagation is used whenever:
- Training any neural network with gradient-based optimization.
- Implementing custom layers that need to define their own gradient computation.
- Understanding what
loss.backward()does under the hood in PyTorch. - Debugging gradient flow issues (vanishing or exploding gradients).
Theoretical Basis
The Chain Rule
For composed functions y = g(f(x)), the chain rule states:
dy/dx = dy/du * du/dx where u = f(x)
In Leibniz notation: if loss = L(out) and out = f(inp), then:
d(loss)/d(inp) = d(loss)/d(out) * d(out)/d(inp)
Forward Pass
For a two-layer network with ReLU:
l1 = inp @ W1 + b1 (linear layer 1) l2 = relu(l1) (activation) out = l2 @ W2 + b2 (linear layer 2) loss = mse(out, target) (loss function)
Backward Pass
Working backward from the loss:
Step 1: MSE gradient
d(loss)/d(out) = 2 * (out - target) / n
where n is the number of elements.
Step 2: Linear layer 2 gradient
Given out = l2 @ W2 + b2:
d(loss)/d(l2) = d(loss)/d(out) @ W2^T d(loss)/d(W2) = l2^T @ d(loss)/d(out) d(loss)/d(b2) = sum(d(loss)/d(out), axis=0)
Step 3: ReLU gradient
d(loss)/d(l1) = d(loss)/d(l2) * relu'(l1) where relu'(x) = 1 if x > 0, else 0
Step 4: Linear layer 1 gradient
Given l1 = inp @ W1 + b1:
d(loss)/d(inp) = d(loss)/d(l1) @ W1^T d(loss)/d(W1) = inp^T @ d(loss)/d(l1) d(loss)/d(b1) = sum(d(loss)/d(l1), axis=0)
Key Properties
- Locality: Each layer only needs the gradient flowing in from above and its own cached forward-pass values.
- Efficiency: The backward pass has roughly the same computational cost as the forward pass.
- Composability: Any differentiable function can be a "layer" as long as it provides a backward method.
Relationship to Autograd
PyTorch's autograd engine implements backpropagation automatically. When requires_grad=True is set on a tensor, PyTorch records all operations on it in a computational graph. Calling loss.backward() traverses this graph in reverse, computing and storing gradients in each tensor's .grad attribute. The manual implementation in Chapter 17 of the fastbook replicates this behavior to demonstrate the underlying mechanics.