Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:LaurentMazare Tch rs SGD Optimization

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Optimization
Last Updated 2026-02-08 14:00 GMT

Overview

Classical gradient descent optimizer that updates parameters proportionally to the negative gradient, optionally with momentum and weight decay.

Description

Stochastic Gradient Descent (SGD) is the simplest optimization algorithm for neural network training. At each step, parameters are updated by subtracting the gradient scaled by the learning rate. Optional momentum accumulates gradients over time for smoother updates. SGD with momentum often achieves better generalization than adaptive optimizers for tasks with sufficient training data. Weight decay provides L2 regularization.

Usage

Use SGD for transfer learning with small classification heads, or when training large models where generalization is more important than convergence speed. Often preferred over Adam for fine-tuning pretrained models.

Theoretical Basis

Without momentum: θt=θt1αL(θt1)

With momentum: vt=μvt1+L(θt1) θt=θt1αvt

Default hyperparameters: momentum=0, dampening=0, weight_decay=0, nesterov=false

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment