Principle:Pyro ppl Pyro Gradient Clipping

Metadata
Sources	Adam: A Method for Stochastic Optimization
Domains	Optimization, Deep_Learning
Last Updated	2026-02-09 12:00 GMT

Overview

Gradient clipping is a stabilization technique for stochastic optimization that prevents exploding gradients by capping gradient norms or values, combined with learning rate decay in Pyro's ClippedAdam optimizer for robust training of deep generative models.

Description

When training deep generative models or probabilistic models with complex likelihood functions, gradient magnitudes can become extremely large (exploding gradients), causing optimization instability and divergence. Gradient clipping addresses this by constraining gradient values before they are used to update parameters.

Pyro's ClippedAdam optimizer combines three stabilization techniques in a single optimizer:

Gradient clipping: Each gradient element is clamped to the range [-clip_norm, clip_norm], preventing any single gradient component from causing an excessively large parameter update. This is an element-wise (value) clipping strategy, as opposed to norm-based clipping which rescales the entire gradient vector.

Learning rate decay: The learning rate is multiplied by a decay factor lrd at each optimization step, implementing a schedule $α_{t} = α_{0} \cdot {lrd}^{t}$ . This is useful for achieving convergence in stochastic optimization, where a decreasing step size is needed to reduce oscillation around the optimum.

Adaptive learning rates: The underlying Adam algorithm provides per-parameter adaptive learning rates based on first and second moment estimates of the gradients.

This combination is particularly important for:

Deep generative models (VAEs, deep Bayesian networks) where gradients flow through many layers
Models with score function estimators (REINFORCE) which tend to have high-variance gradients
Complex likelihood functions that produce steep loss landscapes

Usage

Use ClippedAdam when standard Adam optimization leads to training instability, NaN losses, or divergence. It is a drop-in replacement for Adam in Pyro's SVI framework. The key additional hyperparameters are:

clip_norm (default 10.0): Maximum absolute value for any gradient element
lrd (default 1.0): Learning rate decay multiplier per step (set to less than 1.0 to enable decay)

Theoretical Basis

The ClippedAdam update rule modifies standard Adam by first clipping gradients:

${\tilde{g}}_{t} = clamp (g_{t}, - c, c)$

Where c is the clip norm. The clipped gradient is then used in the standard Adam update with a decaying learning rate:

$α_{t} = α_{0} \cdot {lrd}^{t}$

The combination of clipping and decay helps ensure that the optimization trajectory remains bounded and converges, even in the presence of pathological gradient behavior.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment