Implementation:Fastai Fastbook Backpropagation Manual
| Knowledge Sources | |
|---|---|
| Domains | Deep Learning, Calculus, Automatic Differentiation |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Concrete pattern for implementing backpropagation from scratch using manual chain-rule gradient functions and class-based layers, as demonstrated in fastbook Chapter 17.
Description
This implementation builds backpropagation without relying on PyTorch's autograd. It defines standalone gradient functions (mse_grad, relu_grad, lin_grad) and then refactors them into class-based layers (Relu, Lin, Mse) that each implement __call__ (forward) and backward methods. The gradients are stored directly on tensor attributes (.g), mirroring PyTorch's .grad attribute. The implementation is validated by comparing its computed gradients against PyTorch autograd.
Usage
Use this pattern when:
- Learning how backpropagation works at a fundamental level.
- Building custom autograd systems or understanding the PyTorch autograd internals.
- Debugging gradient issues by comparing manual gradients with
loss.backward()results.
Code Reference
Source Location
- Repository: fastbook
- File: 17_foundations.ipynb (Chapter 17), "Backpropagation" section
Signature
Functional approach:
def mse_grad(inp, targ):
"""Gradient of MSE loss with respect to its input."""
inp.g = 2. * (inp.squeeze() - targ).unsqueeze(-1) / inp.shape[0]
def relu_grad(inp, out):
"""Gradient of ReLU with respect to its input."""
inp.g = (inp > 0).float() * out.g
def lin_grad(inp, out, w, b):
"""Gradient of a linear layer with respect to input, weights, and bias."""
inp.g = out.g @ w.t()
w.g = inp.t() @ out.g
b.g = out.g.sum(0)
def forward_and_backward(inp, targ):
# Forward pass
l1 = inp @ w1 + b1
l2 = relu(l1)
out = l2 @ w2 + b2
loss = mse(out, targ)
# Backward pass (reverse order)
mse_grad(out, targ)
lin_grad(l2, out, w2, b2)
relu_grad(l1, l2)
lin_grad(inp, l1, w1, b1)
Class-based approach:
class Relu:
def __call__(self, inp):
self.inp = inp
self.out = inp.clamp_min(0.)
return self.out
def backward(self):
self.inp.g = (self.inp > 0).float() * self.out.g
class Lin:
def __init__(self, w, b):
self.w, self.b = w, b
def __call__(self, inp):
self.inp = inp
self.out = inp @ self.w + self.b
return self.out
def backward(self):
self.inp.g = self.out.g @ self.w.t()
self.w.g = self.inp.t() @ self.out.g
self.b.g = self.out.g.sum(0)
class Mse:
def __call__(self, inp, targ):
self.inp = inp
self.targ = targ
self.out = (inp.squeeze() - targ).pow(2).mean()
return self.out
def backward(self):
x = (self.inp.squeeze() - self.targ).unsqueeze(-1)
self.inp.g = 2. * x / self.targ.shape[0]
class Model:
def __init__(self, w1, b1, w2, b2):
self.layers = [Lin(w1, b1), Relu(), Lin(w2, b2)]
self.loss = Mse()
def __call__(self, x, targ):
for l in self.layers:
x = l(x)
return self.loss(x, targ)
def backward(self):
self.loss.backward()
for l in reversed(self.layers):
l.backward()
Import
import torch
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| inp | Tensor, shape (batch_size, n_features) |
Yes | Input batch tensor (e.g., flattened images) |
| targ | Tensor, shape (batch_size,) |
Yes | Target values for the batch |
| w1, b1 | Tensor |
Yes | Weights and bias for the first linear layer |
| w2, b2 | Tensor |
Yes | Weights and bias for the second linear layer |
Outputs
| Name | Type | Description |
|---|---|---|
| w1.g | Tensor, same shape as w1 |
Gradient of loss with respect to first-layer weights |
| b1.g | Tensor, same shape as b1 |
Gradient of loss with respect to first-layer bias |
| w2.g | Tensor, same shape as w2 |
Gradient of loss with respect to second-layer weights |
| b2.g | Tensor, same shape as b2 |
Gradient of loss with respect to second-layer bias |
| inp.g | Tensor, same shape as inp |
Gradient of loss with respect to input (for further backprop if needed) |
Usage Examples
Basic Usage: Functional Approach
import torch
# Initialize parameters (Kaiming init)
n_inp = 784
n_hidden = 50
n_out = 1
w1 = torch.randn(n_inp, n_hidden) / n_inp**0.5
b1 = torch.zeros(n_hidden)
w2 = torch.randn(n_hidden, n_out) / n_hidden**0.5
b2 = torch.zeros(n_out)
def relu(x): return x.clamp_min(0.)
def mse(output, targ): return (output.squeeze() - targ).pow(2).mean()
# Run forward and backward
def forward_and_backward(inp, targ):
l1 = inp @ w1 + b1
l2 = relu(l1)
out = l2 @ w2 + b2
loss = mse(out, targ)
mse_grad(out, targ)
lin_grad(l2, out, w2, b2)
relu_grad(l1, l2)
lin_grad(inp, l1, w1, b1)
# Call it
forward_and_backward(x_train, y_train)
# Access gradients
print(w1.g.shape) # torch.Size([784, 50])
print(w2.g.shape) # torch.Size([50, 1])
Validating Against PyTorch Autograd
# Enable autograd tracking
w1_ag = w1.clone().requires_grad_(True)
b1_ag = b1.clone().requires_grad_(True)
w2_ag = w2.clone().requires_grad_(True)
b2_ag = b2.clone().requires_grad_(True)
# Forward pass with autograd
l1_ag = x_train @ w1_ag + b1_ag
l2_ag = l1_ag.clamp_min(0.)
out_ag = l2_ag @ w2_ag + b2_ag
loss_ag = (out_ag.squeeze() - y_train).pow(2).mean()
loss_ag.backward()
# Compare: manual gradients vs autograd
def test_near(a, b):
assert torch.allclose(a, b, rtol=1e-3, atol=1e-5), "Gradients do not match!"
test_near(w1.g, w1_ag.grad)
test_near(b1.g, b1_ag.grad)
test_near(w2.g, w2_ag.grad)
test_near(b2.g, b2_ag.grad)
print("All gradients match PyTorch autograd!")
Class-Based Model
# Create model
model = Model(w1, b1, w2, b2)
# Forward pass (computes loss)
loss = model(x_train, y_train)
# Backward pass (computes all gradients)
model.backward()
# Gradients are now available
print(w1.g.shape) # torch.Size([784, 50])
print(b2.g.shape) # torch.Size([1])