Principle:LaurentMazare Tch rs SGD Optimization
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Optimization |
| Last Updated | 2026-02-08 14:00 GMT |
Overview
Classical gradient descent optimizer that updates parameters proportionally to the negative gradient, optionally with momentum and weight decay.
Description
Stochastic Gradient Descent (SGD) is the simplest optimization algorithm for neural network training. At each step, parameters are updated by subtracting the gradient scaled by the learning rate. Optional momentum accumulates gradients over time for smoother updates. SGD with momentum often achieves better generalization than adaptive optimizers for tasks with sufficient training data. Weight decay provides L2 regularization.
Usage
Use SGD for transfer learning with small classification heads, or when training large models where generalization is more important than convergence speed. Often preferred over Adam for fine-tuning pretrained models.
Theoretical Basis
Without momentum:
With momentum:
Default hyperparameters: momentum=0, dampening=0, weight_decay=0, nesterov=false