Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:LaurentMazare Tch rs Sparse Adam Optimization

From Leeroopedia
Revision as of 17:56, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/LaurentMazare_Tch_rs_Sparse_Adam_Optimization.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Optimization, Deep Learning
Last Updated 2026-02-08 00:00 GMT

Overview

Sparse Adam is a variant of the Adam optimizer that efficiently handles sparse gradients by updating only the moment estimates corresponding to non-zero gradient entries.

Description

The standard Adam optimizer maintains two exponential moving averages for every parameter element: the first moment (mean of gradients) and the second moment (mean of squared gradients). When gradients are sparse (i.e., most entries are zero, as commonly occurs with embedding layers), standard Adam still decays all moment estimates at every step, even for entries with zero gradient. This leads to incorrect moment estimates because the decay falsely treats missing gradient information as a zero gradient signal.

Sparse Adam addresses this by applying a key modification:

  • Selective moment updates: Only the moment estimates corresponding to non-zero gradient entries are updated. Entries with zero gradients retain their previous moment values without decay.
  • Lazy update semantics: For each parameter element, the effective number of updates is tracked. When a gradient entry becomes non-zero after several steps of inactivity, the moment estimates are corrected as if the intermediate zero-gradient steps had not occurred.
  • Bias correction: Like standard Adam, Sparse Adam applies bias correction to the first and second moment estimates. The correction factors account for the initialization at zero and the geometric decay:

The result is that Sparse Adam produces identical updates to standard Adam for non-zero gradient entries while avoiding the pathological decay behavior on inactive entries.

Usage

Sparse Adam is particularly important for training models with embedding layers where each mini-batch only updates a small subset of embedding vectors. It is also relevant in any scenario where gradients are structurally sparse, such as in recommender systems, sparse attention mechanisms, or models with conditional computation paths.

Theoretical Basis

Standard Adam Update:

mt=β1mt1+(1β1)gt

vt=β2vt1+(1β2)gt2

m^t=mt1β1t

v^t=vt1β2t

θt+1=θtαm^tv^t+ϵ

where β1,β2 are decay rates (typically 0.9 and 0.999), α is the learning rate, and ϵ is a small constant for numerical stability.

Sparse Adam Modification:

For each parameter element i, define the set of active steps Si={t:gt,i0}.

The moment updates are applied only when tSi:

mt,i={β1mt1,i+(1β1)gt,iif gt,i0mt1,iif gt,i=0

vt,i={β2vt1,i+(1β2)gt,i2if gt,i0vt1,iif gt,i=0

Bias Correction for Sparse Updates:

The bias correction terms adjust for the actual number of active updates rather than the total step count:

m^t,i=mt,i1β1|Si{1,,t}|

v^t,i=vt,i1β2|Si{1,,t}|

This ensures unbiased estimates even when updates are infrequent for certain parameter elements.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment