Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Pyro ppl Pyro Gradient Variance Reduction

From Leeroopedia



Knowledge Sources
Domains Variational_Inference, Optimization
Last Updated 2026-02-09 09:00 GMT

Overview

Gradient variance reduction techniques including baselines with exponential moving average, tensor detachment for score function estimators, gradient clipping, and LBFGS incompatibility with SVI.

Description

Pyro's score function gradient estimators (used for non-reparameterizable distributions) produce high-variance gradients. Pyro mitigates this through several mechanisms: neural network baselines with exponential moving average decay (beta=0.90), strategic tensor detachment to ensure only one gradient path contributes to score function estimates, and ClippedAdam optimizer that clips gradient norms. Additionally, LBFGS is explicitly excluded from SVI because its full Hessian approximation is incompatible with stochastic optimization.

Usage

Apply this heuristic when training models with discrete latent variables using `TraceGraph_ELBO`, when SVI loss is oscillating or diverging, or when choosing an optimizer for variational inference. Understanding these techniques helps diagnose and fix convergence issues.

The Insight (Rule of Thumb)

  • Action 1: Use `TraceGraph_ELBO` with baselines for models containing non-reparameterizable distributions. The baseline exponential moving average decay is `beta=0.90`.
  • Action 2: Use `ClippedAdam` instead of plain `Adam` when score function gradients have high variance. It wraps Adam with gradient norm clipping.
  • Action 3: Never use LBFGS with SVI; it is explicitly excluded because its full Hessian approximation is incompatible with stochastic updates.
  • Action 4: Detach intermediate tensor values when computing downstream costs to prevent double-counting gradients in score function estimators.
  • Value: Baseline beta = 0.90; ClippedAdam provides optional gradient clipping on top of Adam.
  • Trade-off: Baselines reduce variance but add complexity and computational cost. Gradient clipping prevents divergence but may slow convergence.

Reasoning

Score function (REINFORCE) gradients have variance proportional to the log-probability magnitude, which can be very large. Baselines subtract a control variate that reduces variance without introducing bias. The EMA with beta=0.90 provides a smooth estimate of the expected cost that adapts over training.

Tensor detachment is critical because score function gradients require computing `(cost - baseline) * grad_log_q`. If the cost itself depends on guide parameters (through the surrogate loss), failing to detach creates incorrect double-counted gradients.

LBFGS maintains a history of gradient-parameter pairs to approximate the Hessian. In stochastic settings, gradient noise makes this approximation unreliable, leading to poor search directions.

Code evidence for baseline beta from `pyro/infer/tracegraph_elbo.py:32,49`:

# XXX default for baseline_beta currently set here
baseline_beta = 0.90

Detachment pattern from `pyro/infer/tracegraph_elbo.py:84,90`:

baseline += nn_baseline(detach(nn_baseline_input))
baseline_loss += torch.pow(downstream_cost.detach() - baseline, 2.0).sum()

LBFGS exclusion from `pyro/optim/pytorch_optimizers.py:18-20`:

if _Optim is torch.optim.LBFGS:
    # XXX LBFGS is not supported for SVI yet
    continue

Downstream cost computation note from `pyro/infer/tracegraph_elbo.py:139`:

# nodes_included_in_sum logic could be more fine-grained, possibly
# leading to speed-ups in case there are many duplicates

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment