Principle:Pyro ppl Pyro Advanced ELBO Estimators

Knowledge Sources	Variational Inference with Normalizing Flows Renyi Divergence Variational Inference Tensor Monte Carlo Yes, but Did It Work? Evaluating Variational Inference
Domains	Variational Inference, ELBO Estimation, Monte Carlo Methods
Last Updated	2026-02-09 09:00 GMT

Overview

Advanced ELBO estimators extend the standard Evidence Lower Bound with graph-based variance reduction, alternative divergence measures, tail-adaptive strategies, and tensor-based enumeration to improve the accuracy and efficiency of variational inference.

Description

The standard ELBO (Evidence Lower Bound) is the workhorse objective for variational inference:

ELBO = E_q[log p(x, z) - log q(z)]

However, the standard single-sample ELBO estimator can have high variance, may not tightly bound the evidence, and does not exploit the graphical structure of the model. Several advanced estimators address these limitations:

TraceGraph ELBO: Exploits the dependency graph of the probabilistic program to perform Rao-Blackwellization. For each non-reparameterizable sample site, the score function gradient only needs to account for the cost downstream in the computation graph. By identifying which terms in the ELBO actually depend on each latent variable, the estimator eliminates irrelevant terms from the gradient, substantially reducing variance.

Renyi ELBO: Replaces the KL divergence implicit in the standard ELBO with the Renyi alpha-divergence. For alpha > 1, this gives a tighter bound that is more sensitive to regions where q underfits p. For alpha < 1, it is more mass-covering. The case alpha -> 1 recovers the standard ELBO.

TraceTailAdaptive ELBO: Addresses the problem that standard ELBO gradients can be dominated by a few extreme samples. This estimator adaptively adjusts the contribution of samples based on their position in the weight distribution, downweighting tail samples that cause gradient instability.

TraceTMC ELBO (Tensor Monte Carlo): Combines Monte Carlo sampling with tensor-based exact enumeration. For models with both continuous and discrete latent variables, TMC exactly enumerates discrete variables while sampling continuous ones, reducing variance from discrete variables to zero.

Trace MMD and Energy Distance: Replace the KL divergence with the Maximum Mean Discrepancy or Energy Distance as the variational objective. These kernel-based or distance-based divergences can be more robust and avoid the mode-seeking behavior of KL divergence.

Usage

Use advanced ELBO estimators when:

Standard ELBO has high gradient variance, especially with discrete latent variables (use TraceGraph, TMC).
You need tighter bounds on the evidence for model comparison (use Renyi ELBO).
Training is unstable due to extreme importance weights (use TraceTailAdaptive).
The model mixes discrete and continuous variables (use TMC for exact discrete enumeration).
You want mode-covering rather than mode-seeking inference (use MMD or Energy Distance).

Theoretical Basis

TraceGraph ELBO (Rao-Blackwellized gradient):

# Standard score function gradient:
# grad_phi ELBO = E_q[(log p(x,z) - log q(z)) * grad_phi log q(z)]

# Graph-based gradient (for non-reparameterizable site z_i):
# grad_phi_i ELBO = E_q[downstream_cost(z_i) * grad_phi_i log q(z_i|phi)]

# where downstream_cost(z_i) = sum over sites j reachable from i:
#   log p_j / q_j  (only costs that depend on z_i)

# This removes irrelevant terms, reducing variance:
# Var(graph_gradient) <= Var(standard_gradient)

Renyi ELBO:

# Renyi alpha-divergence:
# D_alpha(q || p) = 1/(alpha-1) * log E_q[(p(x,z)/q(z))^{alpha-1}]

# Renyi ELBO (for alpha > 0, alpha != 1):
# L_alpha = 1/(1-alpha) * log E_q[(p(x,z)/q(z))^{1-alpha}]

# Estimated with K samples:
# L_alpha_hat = 1/(1-alpha) * log(1/K * sum_k (p(x,z_k)/q(z_k))^{1-alpha})

# Properties:
# alpha -> 1: recovers standard ELBO
# alpha > 1: tighter bound, more zero-forcing (mode-seeking)
# alpha < 1: looser bound, more mass-covering
# alpha = 0: recovers log p(x) (but high variance)

Tensor Monte Carlo (TMC):

# For a model with discrete z_d and continuous z_c:
# Standard: sample both z_d and z_c (high variance from z_d)
# TMC: enumerate z_d exactly, sample z_c

# ELBO_TMC = log sum_{z_d} E_{q(z_c)}[p(x, z_d, z_c) / q(z_c)]
#          - sum_{z_d} q(z_d) * log q(z_d)    # discrete entropy (exact)

# The discrete sum eliminates variance from discrete variables
# Tensor operations handle the combinatorial sum efficiently
# Cost: O(|support(z_d)| * num_MC_samples)

Maximum Mean Discrepancy (MMD):

# MMD between distributions p and q using kernel k:
# MMD^2(p, q) = E_{p,p'}[k(z, z')] - 2*E_{p,q}[k(z, z')] + E_{q,q'}[k(z, z')]

# Variational objective:
# min_phi MMD^2(q(z|phi), p(z|x))

# Estimated with samples:
# MMD^2_hat = 1/n^2 sum_{ij} k(z_p^i, z_p^j)
#           - 2/nm sum_{ij} k(z_p^i, z_q^j)
#           + 1/m^2 sum_{ij} k(z_q^i, z_q^j)

Energy Distance:

# Energy distance between p and q:
# D_E(p, q) = 2*E[|X - Y|] - E[|X - X'|] - E[|Y - Y'|]
# where X, X' ~ p and Y, Y' ~ q

# This is a valid metric on probability distributions
# and can be used as a divergence-free training objective

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment