Principle:Pyro ppl Pyro Advanced Optimizers

Knowledge Sources	Adam: A Method for Stochastic Optimization Adaptive Subgradient Methods Multi-Objective Optimization in Variational Inference
Domains	Optimization, Stochastic Gradient Methods, Distributed Computing
Last Updated	2026-02-09 09:00 GMT

Overview

Specialized optimizers for probabilistic programming extend standard gradient descent methods with features like per-parameter adaptation, frequency-domain updates, distributed training support, learning rate scheduling, and multi-objective optimization.

Description

Variational inference and other gradient-based inference methods in probabilistic programming have optimization characteristics that differ from standard deep learning:

AdagradRMSProp: A hybrid optimizer combining Adagrad's ability to handle sparse gradients with RMSProp's ability to handle non-stationary objectives. In probabilistic programming, different parameters may have very different gradient scales (e.g., location parameters vs. scale parameters), and some parameters may receive gradients infrequently (sparse models). This hybrid adapts the learning rate per-parameter based on both the historical gradient magnitude and recent gradient statistics.

DCTAdam: An optimizer that operates in the Discrete Cosine Transform (DCT) domain. For time-series models and other problems with temporal or spatial structure, optimizing in the frequency domain can be more efficient because:

Low-frequency components (smooth trends) and high-frequency components (rapid fluctuations) can have different learning rates.
The DCT decorrelates the gradient, reducing the condition number of the optimization landscape.
Natural for models with Fourier or wavelet structure.

HorovodOptimizer: A wrapper that enables distributed training across multiple GPUs or machines using the Horovod framework. It handles gradient averaging across workers, making it possible to scale variational inference to large datasets by distributing the computation.

PyroLRScheduler: Learning rate scheduling adapted for Pyro's training loops. It wraps PyTorch learning rate schedulers and integrates them with Pyro's SVI (Stochastic Variational Inference) training loop, allowing decay schedules, warm restarts, and other scheduling strategies.

MultiOptimizer: Allows different optimizers (with different learning rates and hyperparameters) for different parameter groups within the same model. This is essential when different parts of the model (e.g., neural network weights vs. variational parameters) benefit from different optimization strategies.

Usage

Use advanced optimizers when:

Different model parameters require different learning rates or optimizer types (MultiOptimizer).
Training on sparse data where some parameters are updated infrequently (AdagradRMSProp).
Working with time-series or spatially structured models (DCTAdam).
Scaling inference to large datasets across multiple GPUs (HorovodOptimizer).
Needing learning rate schedules for stable convergence of variational inference (PyroLRScheduler).

Theoretical Basis

Adagrad with RMSProp decay:

# Adagrad: accumulates squared gradients for per-parameter learning rates
# g_t = gradient at step t
# G_t = G_{t-1} + g_t^2           (accumulated squared gradients)
# theta_t = theta_{t-1} - lr * g_t / (sqrt(G_t) + epsilon)

# Problem: G_t grows monotonically -> learning rate decays to zero

# RMSProp fix: use exponential moving average
# G_t = rho * G_{t-1} + (1 - rho) * g_t^2
# theta_t = theta_{t-1} - lr * g_t / (sqrt(G_t) + epsilon)

# AdagradRMSProp: combines both
# Tracks both accumulated and decayed gradient statistics
# Uses whichever provides the better-adapted learning rate

DCT-domain optimization:

# For a parameter vector theta of length T (e.g., time-series):
# Transform to frequency domain:
# theta_freq = DCT(theta)

# Optimize in frequency domain:
# theta_freq_t = theta_freq_{t-1} - lr * DCT(gradient)

# Transform back:
# theta_t = IDCT(theta_freq_t)

# Benefits:
# - Different learning rates for different frequencies
# - Decorrelated gradients (better conditioning)
# - Low-frequency components (smooth structure) updated faster
# - High-frequency components (noise) regularized naturally

Distributed gradient averaging (Horovod):

# On K workers, each computing gradients on a data subset:
# g_k = gradient on worker k's data batch

# All-reduce: compute average gradient across all workers
# g_avg = (1/K) * sum_{k=1}^{K} g_k

# Each worker applies the same update:
# theta = theta - lr * g_avg

# Equivalent to training on K times larger batch
# Linear speedup with number of workers (ideal case)

Multi-optimizer pattern:

# Different parameter groups, different optimizers:
# Group 1 (neural network weights): Adam(lr=1e-3)
# Group 2 (variational location params): Adam(lr=1e-2)
# Group 3 (variational scale params): SGD(lr=1e-3)

# At each step:
# for group, optimizer in groups:
#     grads = compute_gradients(loss, group.params)
#     optimizer.step(grads)

# This allows fine-grained control over the optimization landscape

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment