Principle:Pyro ppl Pyro Mean Field Variational Inference
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | Paper (Stochastic Variational Inference, Hoffman et al. 2013), Repo (Pyro) |
| Domains | Bayesian_Inference, Variational_Inference |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Mean-field variational inference approximates the posterior distribution over latent variables with a fully factorized product of independent distributions, typically one Normal distribution per latent variable. This sacrifices the ability to capture posterior correlations in exchange for computational tractability and scalable optimization.
Description
In Bayesian inference, the true posterior distribution p(z|x) over latent variables z given observed data x is generally intractable. Variational inference reformulates posterior computation as an optimization problem: find an approximate distribution q(z) within a tractable family that minimizes the Kullback-Leibler divergence to the true posterior.
The Mean-Field Assumption
The mean-field assumption imposes the strongest factorization: the variational distribution is a product of independent marginals, one per latent variable:
q(z) = product_i q_i(z_i)
Each factor q_i(z_i) is typically chosen to be a Normal (Gaussian) distribution with its own location and scale parameters:
q_i(z_i) = Normal(loc_i, scale_i)
This yields a total of 2d variational parameters for d latent variables (one location and one scale per variable). The independence assumption means that the resulting covariance matrix is diagonal -- all off-diagonal entries are exactly zero.
ELBO Optimization
The parameters {loc_i, scale_i} are optimized by maximizing the Evidence Lower BOund (ELBO):
ELBO(q) = E_q[log p(x, z)] - E_q[log q(z)]
Maximizing the ELBO is equivalent to minimizing KL(q(z) || p(z|x)). Under the mean-field assumption, the ELBO decomposes into per-site terms, making gradient computation particularly efficient.
Tradeoffs
- Computational tractability: The number of variational parameters scales linearly as
O(d)rather than quadratically asO(d^2)for a full covariance approximation. - Posterior correlations are lost: Because each
q_iis independent, the approximation cannot capture correlations between latent variables. This can lead to underestimation of posterior variance and overconfident credible intervals. - Scalability: Mean-field VI combines naturally with stochastic gradient methods and data subsampling, enabling application to large datasets and models with many parameters.
- Mode-seeking behavior: Minimizing
KL(q || p)tends to place mass on modes of the posterior rather than covering the full posterior mass, which can miss multimodality.
Usage
Mean-field variational inference is applied when:
- Full posterior inference is intractable: Models with many latent variables where MCMC sampling would be prohibitively slow.
- Approximate uncertainty is sufficient: Applications where marginal uncertainty per variable is more important than joint posterior correlations.
- Scalability to large data: When the dataset is too large for batch MCMC methods, stochastic variational inference with mean-field guides enables minibatch training.
- Quick prototyping: As a fast baseline approximation before investing in more expressive variational families (e.g., full-rank or normalizing flows).
In Pyro, the mean-field assumption is implemented by the AutoNormal guide, which automatically constructs an independent Normal variational distribution for each latent site discovered in the model.
Theoretical Basis
Variational Inference Framework
Given a model with joint distribution p(x, z) = p(x|z) p(z), the log marginal likelihood satisfies:
log p(x) = ELBO(q) + KL(q(z) || p(z|x))
Since the KL divergence is non-negative, the ELBO is a lower bound on the log evidence. Maximizing the ELBO tightens this bound and drives q(z) closer to the true posterior.
Mean-Field Factorization
The mean-field family restricts q to fully factorized distributions:
Q_MF = { q : q(z) = product_{i=1}^{d} q_i(z_i) }
When each factor is Gaussian, this yields:
q(z) = product_{i=1}^{d} Normal(z_i; loc_i, scale_i)
The entropy of this factorized distribution is the sum of individual entropies:
H[q] = sum_{i=1}^{d} H[q_i] = sum_{i=1}^{d} (1/2) log(2 * pi * e * scale_i^2)
This additive decomposition makes gradient computation straightforward and enables per-site parameter updates.
Stochastic Variational Inference
Hoffman et al. (2013) showed that the mean-field ELBO can be optimized using stochastic gradient ascent with noisy gradient estimates obtained from minibatches of data. The key insight is that the ELBO gradient with respect to variational parameters can be estimated using a small number of Monte Carlo samples from q, combined with the reparameterization trick for continuous latent variables.
This enables mean-field VI to scale to datasets with millions of observations, in contrast to classical coordinate-ascent variational inference which requires full passes over the data.