Implementation:Fastai Fastbook Backpropagation Manual

Knowledge Sources	fastbook fastai docs PyTorch autograd docs
Domains	Deep Learning, Calculus, Automatic Differentiation
Last Updated	2026-02-09 17:00 GMT

Overview

Concrete pattern for implementing backpropagation from scratch using manual chain-rule gradient functions and class-based layers, as demonstrated in fastbook Chapter 17.

Description

This implementation builds backpropagation without relying on PyTorch's autograd. It defines standalone gradient functions (mse_grad, relu_grad, lin_grad) and then refactors them into class-based layers (Relu, Lin, Mse) that each implement __call__ (forward) and backward methods. The gradients are stored directly on tensor attributes (.g), mirroring PyTorch's .grad attribute. The implementation is validated by comparing its computed gradients against PyTorch autograd.

Usage

Use this pattern when:

Learning how backpropagation works at a fundamental level.
Building custom autograd systems or understanding the PyTorch autograd internals.
Debugging gradient issues by comparing manual gradients with loss.backward() results.

Code Reference

Source Location

Repository: fastbook
File: 17_foundations.ipynb (Chapter 17), "Backpropagation" section

Signature

Functional approach:

def mse_grad(inp, targ):
    """Gradient of MSE loss with respect to its input."""
    inp.g = 2. * (inp.squeeze() - targ).unsqueeze(-1) / inp.shape[0]

def relu_grad(inp, out):
    """Gradient of ReLU with respect to its input."""
    inp.g = (inp > 0).float() * out.g

def lin_grad(inp, out, w, b):
    """Gradient of a linear layer with respect to input, weights, and bias."""
    inp.g = out.g @ w.t()
    w.g = inp.t() @ out.g
    b.g = out.g.sum(0)

def forward_and_backward(inp, targ):
    # Forward pass
    l1 = inp @ w1 + b1
    l2 = relu(l1)
    out = l2 @ w2 + b2
    loss = mse(out, targ)

    # Backward pass (reverse order)
    mse_grad(out, targ)
    lin_grad(l2, out, w2, b2)
    relu_grad(l1, l2)
    lin_grad(inp, l1, w1, b1)

Class-based approach:

class Relu:
    def __call__(self, inp):
        self.inp = inp
        self.out = inp.clamp_min(0.)
        return self.out

    def backward(self):
        self.inp.g = (self.inp > 0).float() * self.out.g

class Lin:
    def __init__(self, w, b):
        self.w, self.b = w, b

    def __call__(self, inp):
        self.inp = inp
        self.out = inp @ self.w + self.b
        return self.out

    def backward(self):
        self.inp.g = self.out.g @ self.w.t()
        self.w.g = self.inp.t() @ self.out.g
        self.b.g = self.out.g.sum(0)

class Mse:
    def __call__(self, inp, targ):
        self.inp = inp
        self.targ = targ
        self.out = (inp.squeeze() - targ).pow(2).mean()
        return self.out

    def backward(self):
        x = (self.inp.squeeze() - self.targ).unsqueeze(-1)
        self.inp.g = 2. * x / self.targ.shape[0]

class Model:
    def __init__(self, w1, b1, w2, b2):
        self.layers = [Lin(w1, b1), Relu(), Lin(w2, b2)]
        self.loss = Mse()

    def __call__(self, x, targ):
        for l in self.layers:
            x = l(x)
        return self.loss(x, targ)

    def backward(self):
        self.loss.backward()
        for l in reversed(self.layers):
            l.backward()

Import

import torch

I/O Contract

Inputs

Name	Type	Required	Description
inp	`Tensor`, shape `(batch_size, n_features)`	Yes	Input batch tensor (e.g., flattened images)
targ	`Tensor`, shape `(batch_size,)`	Yes	Target values for the batch
w1, b1	`Tensor`	Yes	Weights and bias for the first linear layer
w2, b2	`Tensor`	Yes	Weights and bias for the second linear layer

Outputs

Name	Type	Description
w1.g	`Tensor`, same shape as `w1`	Gradient of loss with respect to first-layer weights
b1.g	`Tensor`, same shape as `b1`	Gradient of loss with respect to first-layer bias
w2.g	`Tensor`, same shape as `w2`	Gradient of loss with respect to second-layer weights
b2.g	`Tensor`, same shape as `b2`	Gradient of loss with respect to second-layer bias
inp.g	`Tensor`, same shape as `inp`	Gradient of loss with respect to input (for further backprop if needed)

Usage Examples

Basic Usage: Functional Approach

import torch

# Initialize parameters (Kaiming init)
n_inp = 784
n_hidden = 50
n_out = 1

w1 = torch.randn(n_inp, n_hidden) / n_inp**0.5
b1 = torch.zeros(n_hidden)
w2 = torch.randn(n_hidden, n_out) / n_hidden**0.5
b2 = torch.zeros(n_out)

def relu(x): return x.clamp_min(0.)
def mse(output, targ): return (output.squeeze() - targ).pow(2).mean()

# Run forward and backward
def forward_and_backward(inp, targ):
    l1 = inp @ w1 + b1
    l2 = relu(l1)
    out = l2 @ w2 + b2
    loss = mse(out, targ)

    mse_grad(out, targ)
    lin_grad(l2, out, w2, b2)
    relu_grad(l1, l2)
    lin_grad(inp, l1, w1, b1)

# Call it
forward_and_backward(x_train, y_train)

# Access gradients
print(w1.g.shape)  # torch.Size([784, 50])
print(w2.g.shape)  # torch.Size([50, 1])

Validating Against PyTorch Autograd

# Enable autograd tracking
w1_ag = w1.clone().requires_grad_(True)
b1_ag = b1.clone().requires_grad_(True)
w2_ag = w2.clone().requires_grad_(True)
b2_ag = b2.clone().requires_grad_(True)

# Forward pass with autograd
l1_ag = x_train @ w1_ag + b1_ag
l2_ag = l1_ag.clamp_min(0.)
out_ag = l2_ag @ w2_ag + b2_ag
loss_ag = (out_ag.squeeze() - y_train).pow(2).mean()
loss_ag.backward()

# Compare: manual gradients vs autograd
def test_near(a, b):
    assert torch.allclose(a, b, rtol=1e-3, atol=1e-5), "Gradients do not match!"

test_near(w1.g, w1_ag.grad)
test_near(b1.g, b1_ag.grad)
test_near(w2.g, w2_ag.grad)
test_near(b2.g, b2_ag.grad)
print("All gradients match PyTorch autograd!")

Class-Based Model

# Create model
model = Model(w1, b1, w2, b2)

# Forward pass (computes loss)
loss = model(x_train, y_train)

# Backward pass (computes all gradients)
model.backward()

# Gradients are now available
print(w1.g.shape)  # torch.Size([784, 50])
print(b2.g.shape)  # torch.Size([1])

Related Pages

Implements Principle

Principle:Fastai_Fastbook_Backpropagation

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment