Principle:LaurentMazare Tch rs Sparse Adam Optimization

Knowledge Sources	LaurentMazare_Tch_rs Adam: A Method for Stochastic Optimization
Domains	Optimization, Deep Learning
Last Updated	2026-02-08 00:00 GMT

Overview

Sparse Adam is a variant of the Adam optimizer that efficiently handles sparse gradients by updating only the moment estimates corresponding to non-zero gradient entries.

Description

The standard Adam optimizer maintains two exponential moving averages for every parameter element: the first moment (mean of gradients) and the second moment (mean of squared gradients). When gradients are sparse (i.e., most entries are zero, as commonly occurs with embedding layers), standard Adam still decays all moment estimates at every step, even for entries with zero gradient. This leads to incorrect moment estimates because the decay falsely treats missing gradient information as a zero gradient signal.

Sparse Adam addresses this by applying a key modification:

Selective moment updates: Only the moment estimates corresponding to non-zero gradient entries are updated. Entries with zero gradients retain their previous moment values without decay.

Lazy update semantics: For each parameter element, the effective number of updates is tracked. When a gradient entry becomes non-zero after several steps of inactivity, the moment estimates are corrected as if the intermediate zero-gradient steps had not occurred.

Bias correction: Like standard Adam, Sparse Adam applies bias correction to the first and second moment estimates. The correction factors account for the initialization at zero and the geometric decay:

The result is that Sparse Adam produces identical updates to standard Adam for non-zero gradient entries while avoiding the pathological decay behavior on inactive entries.

Usage

Sparse Adam is particularly important for training models with embedding layers where each mini-batch only updates a small subset of embedding vectors. It is also relevant in any scenario where gradients are structurally sparse, such as in recommender systems, sparse attention mechanisms, or models with conditional computation paths.

Theoretical Basis

Standard Adam Update:

$m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}$

$v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$

${\hat{m}}_{t} = \frac{m_{t}}{1 - β_{1}^{t}}$

${\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}}$

$θ_{t + 1} = θ_{t} - α \frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ}$

where $β_{1}, β_{2}$ are decay rates (typically 0.9 and 0.999), $α$ is the learning rate, and $ϵ$ is a small constant for numerical stability.

Sparse Adam Modification:

For each parameter element $i$ , define the set of active steps $S_{i} = {t : g_{t, i} \neq 0}$ .

The moment updates are applied only when $t \in S_{i}$ :

$m_{t, i} = {\begin{cases} β_{1} m_{t - 1, i} + (1 - β_{1}) g_{t, i} & if g_{t, i} \neq 0 \\ m_{t - 1, i} & if g_{t, i} = 0 \end{cases}$

$v_{t, i} = {\begin{cases} β_{2} v_{t - 1, i} + (1 - β_{2}) g_{t, i}^{2} & if g_{t, i} \neq 0 \\ v_{t - 1, i} & if g_{t, i} = 0 \end{cases}$

Bias Correction for Sparse Updates:

The bias correction terms adjust for the actual number of active updates rather than the total step count:

${\hat{m}}_{t, i} = \frac{m_{t, i}}{1 - β_{1}^{| S_{i} \cap {1, \dots, t} |}}$

${\hat{v}}_{t, i} = \frac{v_{t, i}}{1 - β_{2}^{| S_{i} \cap {1, \dots, t} |}}$

This ensures unbiased estimates even when updates are infrequent for certain parameter elements.

Related Pages

Implementation:LaurentMazare_Tch_rs_Sparse_Adam

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment