Principle:Pyro ppl Pyro Stochastic Optimization

Metadata
Sources	Adam: A Method for Stochastic Optimization
Domains	Optimization, Variational_Inference
Last Updated	2026-02-09 12:00 GMT

Overview

Stochastic gradient descent and its adaptive variants are used to optimize variational parameters in Pyro's stochastic variational inference (SVI) framework, with a dynamic parameter management system that wraps PyTorch optimizers.

Description

In stochastic variational inference, the ELBO objective is maximized (equivalently, the negative ELBO loss is minimized) with respect to the variational parameters of the guide distribution. Since the ELBO is estimated via Monte Carlo sampling, its gradients are noisy, making stochastic optimization methods essential.

Pyro provides the PyroOptim wrapper class that adapts standard PyTorch optimizers for use with Pyro's dynamic parameter creation model. Unlike standard PyTorch training where all parameters are known at model construction time, Pyro models can create new parameters dynamically during execution (e.g., via pyro.param()). PyroOptim handles this by:

Per-parameter optimizer state management: Each parameter gets its own optimizer instance, created on first encounter
Dynamic parameter registration: New parameters discovered during SVI steps are automatically registered with fresh optimizer instances
State persistence: Optimizer state can be saved and loaded, with pending state correctly applied when parameters are first seen
Gradient clipping support: Optional gradient norm and value clipping via clip_args

The Adam optimizer is the most commonly used optimizer in Pyro, combining momentum-based gradient averaging with adaptive per-parameter learning rates. It maintains exponential moving averages of both the gradient (first moment) and the squared gradient (second moment), producing updates that are robust to noisy gradients and varying parameter scales.

Usage

Use a PyroOptim-wrapped optimizer as the optimizer argument to pyro.infer.SVI. The optimizer is responsible for updating all variational parameters discovered during the SVI training loop. The Adam optimizer with a learning rate around 0.001--0.01 is a good default starting point for most variational inference problems.

Per-parameter learning rates can be specified by passing a callable to optim_args that returns a dictionary of optimizer arguments given the parameter name.

Theoretical Basis

The Adam update rule for parameter θ at step t:

$m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}$

$v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$

${\hat{m}}_{t} = \frac{m_{t}}{1 - β_{1}^{t}}, {\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}}$

$θ_{t} = θ_{t - 1} - \frac{α}{\sqrt{{\hat{v}}_{t}} + ϵ} {\hat{m}}_{t}$

Where g_t is the stochastic gradient, α is the learning rate, β_1 and β_2 are decay rates for the moment estimates, and ε is a small constant for numerical stability.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment