Principle:Fastai Fastbook Stochastic Gradient Descent
| Knowledge Sources | |
|---|---|
| Domains | Deep Learning, Optimization, Machine Learning |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Stochastic Gradient Descent (SGD) is an iterative optimization algorithm that adjusts model parameters by computing the gradient of a loss function on small random subsets (mini-batches) of training data and stepping in the direction that reduces the loss.
Description
SGD is the foundational optimization algorithm behind all modern deep learning. Rather than computing the gradient over the entire dataset (which is expensive), SGD approximates it by sampling a small mini-batch at each step. This introduces noise into the gradient estimate, but in practice the noise acts as a beneficial regularizer and the method converges efficiently.
The algorithm proceeds through a repeating cycle of seven steps:
- Initialize parameters with random values.
- Predict outputs by running the model on the current mini-batch.
- Compute loss by comparing predictions to targets.
- Compute gradients of the loss with respect to all parameters (via backpropagation).
- Update parameters by subtracting the gradient scaled by a learning rate.
- Zero the gradients so they do not accumulate across steps.
- Repeat from step 2 until convergence.
Usage
Use SGD (or one of its variants like Adam) whenever you need to:
- Train any parameterized model (linear, neural network, etc.) to minimize a loss function.
- Find optimal weights and biases that fit training data.
- Implement a training loop from scratch to understand the fundamentals before using library abstractions.
Theoretical Basis
The Gradient
For a scalar loss function L(w) that depends on parameter vector w, the gradient is the vector of partial derivatives:
grad_L = [dL/dw_1, dL/dw_2, ..., dL/dw_n]
The gradient points in the direction of steepest increase of the loss. Therefore, moving in the opposite direction (subtracting the gradient) decreases the loss.
Parameter Update Rule
The basic SGD update rule is:
w_new = w_old - lr * grad_L(w_old)
Where lr is the learning rate, a small positive scalar (typically between 0.001 and 0.1) that controls the step size.
Learning Rate Selection
The learning rate is the most critical hyperparameter:
- Too small: Convergence is very slow, requiring many iterations.
- Too large: The loss may diverge or oscillate wildly, never converging.
- Just right: The loss decreases steadily toward a minimum.
Mini-batch Stochasticity
In full-batch gradient descent, the gradient is computed over all N training samples. In SGD, the gradient is computed over a mini-batch of size B << N:
grad_approx = (1/B) * sum(grad_L_i for i in mini_batch)
This stochastic approximation has lower computational cost per step and often leads to better generalization because the noise helps escape shallow local minima.
Gradient Zeroing
After each parameter update, gradients must be reset to zero. Without zeroing, PyTorch accumulates gradients across successive backward() calls, which would corrupt the optimization:
for each step:
compute gradients # backward()
update parameters # w -= lr * w.grad
zero gradients # w.grad = 0 (or None)
The requires_grad Mechanism
Parameters must be flagged with requires_grad=True so that the automatic differentiation engine records operations on them and can compute gradients during the backward pass.