Principle:LaurentMazare Tch rs Adam Optimization

Knowledge Sources	Adam: A Method for Stochastic Optimization tch-rs
Domains	Deep_Learning, Optimization
Last Updated	2026-02-08 14:00 GMT

Overview

Adaptive gradient descent optimizer that maintains per-parameter learning rates using first and second moment estimates of gradients.

Description

Adam (Adaptive Moment Estimation) combines the benefits of AdaGrad (per-parameter learning rates) and RMSProp (exponential moving average of squared gradients). It maintains two exponential moving averages: the first moment (mean of gradients, controlled by beta1) and the second moment (mean of squared gradients, controlled by beta2). Bias correction compensates for initialization at zero. Adam is the default optimizer for most deep learning tasks due to its robustness to learning rate selection and fast convergence.

Usage

Use Adam as the default optimizer for most training tasks. It works well with default hyperparameters (lr=1e-3, beta1=0.9, beta2=0.999) and requires minimal tuning compared to SGD. Prefer AdamW for tasks requiring weight decay regularization.

Theoretical Basis

$m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}$ $v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$ ${\hat{m}}_{t} = \frac{m_{t}}{1 - β_{1}^{t}}, {\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}}$ $θ_{t} = θ_{t - 1} - \frac{α}{\sqrt{{\hat{v}}_{t}} + ϵ} {\hat{m}}_{t}$

Default hyperparameters: beta1=0.9, beta2=0.999, eps=1e-8

Related Pages

Implemented By

Implementation:LaurentMazare_Tch_rs_Adam_Build

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment