Principle:Pyro ppl Pyro Stochastic Optimization
| Metadata | |
|---|---|
| Sources | Adam: A Method for Stochastic Optimization |
| Domains | Optimization, Variational_Inference |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Stochastic gradient descent and its adaptive variants are used to optimize variational parameters in Pyro's stochastic variational inference (SVI) framework, with a dynamic parameter management system that wraps PyTorch optimizers.
Description
In stochastic variational inference, the ELBO objective is maximized (equivalently, the negative ELBO loss is minimized) with respect to the variational parameters of the guide distribution. Since the ELBO is estimated via Monte Carlo sampling, its gradients are noisy, making stochastic optimization methods essential.
Pyro provides the PyroOptim wrapper class that adapts standard PyTorch optimizers for use with Pyro's dynamic parameter creation model. Unlike standard PyTorch training where all parameters are known at model construction time, Pyro models can create new parameters dynamically during execution (e.g., via pyro.param()). PyroOptim handles this by:
- Per-parameter optimizer state management: Each parameter gets its own optimizer instance, created on first encounter
- Dynamic parameter registration: New parameters discovered during SVI steps are automatically registered with fresh optimizer instances
- State persistence: Optimizer state can be saved and loaded, with pending state correctly applied when parameters are first seen
- Gradient clipping support: Optional gradient norm and value clipping via
clip_args
The Adam optimizer is the most commonly used optimizer in Pyro, combining momentum-based gradient averaging with adaptive per-parameter learning rates. It maintains exponential moving averages of both the gradient (first moment) and the squared gradient (second moment), producing updates that are robust to noisy gradients and varying parameter scales.
Usage
Use a PyroOptim-wrapped optimizer as the optimizer argument to pyro.infer.SVI. The optimizer is responsible for updating all variational parameters discovered during the SVI training loop. The Adam optimizer with a learning rate around 0.001--0.01 is a good default starting point for most variational inference problems.
Per-parameter learning rates can be specified by passing a callable to optim_args that returns a dictionary of optimizer arguments given the parameter name.
Theoretical Basis
The Adam update rule for parameter θ at step t:
Where g_t is the stochastic gradient, α is the learning rate, β_1 and β_2 are decay rates for the moment estimates, and ε is a small constant for numerical stability.