Principle:LaurentMazare Tch rs SGD Optimization

Knowledge Sources	tch-rs
Domains	Deep_Learning, Optimization
Last Updated	2026-02-08 14:00 GMT

Overview

Classical gradient descent optimizer that updates parameters proportionally to the negative gradient, optionally with momentum and weight decay.

Description

Stochastic Gradient Descent (SGD) is the simplest optimization algorithm for neural network training. At each step, parameters are updated by subtracting the gradient scaled by the learning rate. Optional momentum accumulates gradients over time for smoother updates. SGD with momentum often achieves better generalization than adaptive optimizers for tasks with sufficient training data. Weight decay provides L2 regularization.

Usage

Use SGD for transfer learning with small classification heads, or when training large models where generalization is more important than convergence speed. Often preferred over Adam for fine-tuning pretrained models.

Theoretical Basis

Without momentum: $θ_{t} = θ_{t - 1} - α \nabla L (θ_{t - 1})$

With momentum: $v_{t} = μ v_{t - 1} + \nabla L (θ_{t - 1})$ $θ_{t} = θ_{t - 1} - α v_{t}$

Default hyperparameters: momentum=0, dampening=0, weight_decay=0, nesterov=false

Related Pages

Implemented By

Implementation:LaurentMazare_Tch_rs_Sgd_Build

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment