Principle:Fastai Fastbook Stochastic Gradient Descent

Knowledge Sources	Robbins & Monro (1951), "A Stochastic Approximation Method" Deep Learning for Coders with fastai and PyTorch
Domains	Deep Learning, Optimization, Machine Learning
Last Updated	2026-02-09 17:00 GMT

Overview

Stochastic Gradient Descent (SGD) is an iterative optimization algorithm that adjusts model parameters by computing the gradient of a loss function on small random subsets (mini-batches) of training data and stepping in the direction that reduces the loss.

Description

SGD is the foundational optimization algorithm behind all modern deep learning. Rather than computing the gradient over the entire dataset (which is expensive), SGD approximates it by sampling a small mini-batch at each step. This introduces noise into the gradient estimate, but in practice the noise acts as a beneficial regularizer and the method converges efficiently.

The algorithm proceeds through a repeating cycle of seven steps:

Initialize parameters with random values.
Predict outputs by running the model on the current mini-batch.
Compute loss by comparing predictions to targets.
Compute gradients of the loss with respect to all parameters (via backpropagation).
Update parameters by subtracting the gradient scaled by a learning rate.
Zero the gradients so they do not accumulate across steps.
Repeat from step 2 until convergence.

Usage

Use SGD (or one of its variants like Adam) whenever you need to:

Train any parameterized model (linear, neural network, etc.) to minimize a loss function.
Find optimal weights and biases that fit training data.
Implement a training loop from scratch to understand the fundamentals before using library abstractions.

Theoretical Basis

The Gradient

For a scalar loss function L(w) that depends on parameter vector w, the gradient is the vector of partial derivatives:

grad_L = [dL/dw_1, dL/dw_2, ..., dL/dw_n]

The gradient points in the direction of steepest increase of the loss. Therefore, moving in the opposite direction (subtracting the gradient) decreases the loss.

Parameter Update Rule

The basic SGD update rule is:

w_new = w_old - lr * grad_L(w_old)

Where lr is the learning rate, a small positive scalar (typically between 0.001 and 0.1) that controls the step size.

Learning Rate Selection

The learning rate is the most critical hyperparameter:

Too small: Convergence is very slow, requiring many iterations.
Too large: The loss may diverge or oscillate wildly, never converging.
Just right: The loss decreases steadily toward a minimum.

Mini-batch Stochasticity

In full-batch gradient descent, the gradient is computed over all N training samples. In SGD, the gradient is computed over a mini-batch of size B << N:

grad_approx = (1/B) * sum(grad_L_i for i in mini_batch)

This stochastic approximation has lower computational cost per step and often leads to better generalization because the noise helps escape shallow local minima.

Gradient Zeroing

After each parameter update, gradients must be reset to zero. Without zeroing, PyTorch accumulates gradients across successive backward() calls, which would corrupt the optimization:

for each step:
    compute gradients      # backward()
    update parameters      # w -= lr * w.grad
    zero gradients         # w.grad = 0 (or None)

The requires_grad Mechanism

Parameters must be flagged with requires_grad=True so that the automatic differentiation engine records operations on them and can compute gradients during the backward pass.

Related Pages

Implemented By

Implementation:Fastai_Fastbook_SGD_Manual

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment