Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Fastai Fastbook Backpropagation

From Leeroopedia


Knowledge Sources
Domains Deep Learning, Calculus, Automatic Differentiation
Last Updated 2026-02-09 17:00 GMT

Overview

Backpropagation is the algorithm for computing the gradient of a loss function with respect to every parameter in a neural network by systematically applying the chain rule of calculus from the output layer backward through each preceding layer.

Description

A neural network is a composition of functions: the output of one layer feeds into the next. To train the network with gradient descent, we need the derivative of the final loss with respect to every weight and bias in every layer. Computing these derivatives directly would be intractable for large networks.

Backpropagation solves this by exploiting the chain rule: rather than differentiating the entire composed function at once, it computes gradients layer by layer in reverse order. Each layer receives the gradient of the loss with respect to its output (from the layer above), and uses it to compute:

  1. The gradient of the loss with respect to its inputs (passed to the layer below).
  2. The gradient of the loss with respect to its parameters (used to update weights).

This two-pass structure (forward pass to compute outputs, backward pass to compute gradients) is the foundation of all modern deep learning training.

Usage

Backpropagation is used whenever:

  • Training any neural network with gradient-based optimization.
  • Implementing custom layers that need to define their own gradient computation.
  • Understanding what loss.backward() does under the hood in PyTorch.
  • Debugging gradient flow issues (vanishing or exploding gradients).

Theoretical Basis

The Chain Rule

For composed functions y = g(f(x)), the chain rule states:

dy/dx = dy/du * du/dx    where u = f(x)

In Leibniz notation: if loss = L(out) and out = f(inp), then:

d(loss)/d(inp) = d(loss)/d(out) * d(out)/d(inp)

Forward Pass

For a two-layer network with ReLU:

l1  = inp @ W1 + b1        (linear layer 1)
l2  = relu(l1)              (activation)
out = l2 @ W2 + b2          (linear layer 2)
loss = mse(out, target)      (loss function)

Backward Pass

Working backward from the loss:

Step 1: MSE gradient

d(loss)/d(out) = 2 * (out - target) / n

where n is the number of elements.

Step 2: Linear layer 2 gradient

Given out = l2 @ W2 + b2:

d(loss)/d(l2) = d(loss)/d(out) @ W2^T
d(loss)/d(W2) = l2^T @ d(loss)/d(out)
d(loss)/d(b2) = sum(d(loss)/d(out), axis=0)

Step 3: ReLU gradient

d(loss)/d(l1) = d(loss)/d(l2) * relu'(l1)
where relu'(x) = 1 if x > 0, else 0

Step 4: Linear layer 1 gradient

Given l1 = inp @ W1 + b1:

d(loss)/d(inp) = d(loss)/d(l1) @ W1^T
d(loss)/d(W1)  = inp^T @ d(loss)/d(l1)
d(loss)/d(b1)  = sum(d(loss)/d(l1), axis=0)

Key Properties

  • Locality: Each layer only needs the gradient flowing in from above and its own cached forward-pass values.
  • Efficiency: The backward pass has roughly the same computational cost as the forward pass.
  • Composability: Any differentiable function can be a "layer" as long as it provides a backward method.

Relationship to Autograd

PyTorch's autograd engine implements backpropagation automatically. When requires_grad=True is set on a tensor, PyTorch records all operations on it in a computational graph. Calling loss.backward() traverses this graph in reverse, computing and storing gradients in each tensor's .grad attribute. The manual implementation in Chapter 17 of the fastbook replicates this behavior to demonstrate the underlying mechanics.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment