Principle:Pyro ppl Pyro Stein Variational Inference
| Knowledge Sources | |
|---|---|
| Domains | Variational Inference, Kernel Methods, Particle Methods |
| Last Updated | 2026-02-09 09:00 GMT |
Overview
Stein Variational Gradient Descent (SVGD) is a non-parametric variational inference algorithm that approximates the posterior using a set of interacting particles, combining the flexibility of particle methods with the efficiency of gradient-based optimization.
Description
Traditional variational inference restricts the approximate posterior to a parametric family (e.g., Gaussian) and optimizes its parameters. This limits expressiveness -- the true posterior may be multimodal, skewed, or otherwise poorly approximated by the chosen family.
SVGD takes a different approach: it represents the approximate posterior as a set of particles (points in parameter space) and iteratively moves them to approximate the target distribution. The particles interact through a kernel function that balances two objectives:
- Driving particles toward high-probability regions of the target (using the gradient of the log-posterior).
- Repelling particles from each other to maintain diversity and cover the full posterior (using the gradient of the kernel).
This balance is achieved through the Stein operator, which provides the direction of steepest descent in the KL divergence between the particle distribution and the target, within a reproducing kernel Hilbert space (RKHS).
Key properties of SVGD:
- Non-parametric: The particle approximation can capture arbitrary posterior shapes including multimodality.
- Deterministic updates: Unlike MCMC, SVGD uses deterministic gradient-based updates, making it more stable and easier to tune.
- Parallelizable: All particles can be updated simultaneously.
- Converges to the target: As the number of particles approaches infinity, the particle distribution converges to the true posterior.
- Reduces to MAP with one particle: With a single particle, SVGD reduces to maximum a posteriori estimation.
Usage
Use SVGD when:
- The posterior is expected to be multimodal and parametric variational families are inadequate.
- You want a non-parametric approximation with gradient-based efficiency.
- You need an ensemble of posterior samples for uncertainty quantification.
- The target distribution has a tractable gradient of the log-density (score function).
- You want better calibrated uncertainty estimates than mean-field VI provides.
Theoretical Basis
Stein's identity and the Stein operator:
# Stein's identity: for smooth function phi and target p:
# E_p[A_p phi(x)] = 0
# where A_p phi(x) = grad_x log p(x) * phi(x) + grad_x phi(x)
# The Stein discrepancy measures how far q is from p:
# S(q, p) = max_{phi in F} |E_q[A_p phi(x)]|^2
# When F is a unit ball in an RKHS with kernel k:
# The optimal phi* has a closed-form solution
SVGD update rule:
# Given n particles {x_1, ..., x_n} and kernel k(x, x'):
# Update each particle:
# phi*(x_i) = (1/n) * sum_{j=1}^{n} [
# k(x_j, x_i) * grad_{x_j} log p(x_j) # driving force
# + grad_{x_j} k(x_j, x_i) # repulsive force
# ]
# x_i <- x_i + epsilon * phi*(x_i)
# The first term: moves x_i toward high-probability regions
# (weighted average of score functions at all particles,
# with higher weight for nearby particles)
# The second term: pushes x_i away from other particles
# (prevents collapse to a single mode)
Kernel choice:
# Common choice: RBF (Gaussian) kernel
# k(x, x') = exp(-|x - x'|^2 / (2 * h^2))
# Bandwidth h is typically set via the median heuristic:
# h = median({|x_i - x_j| : i != j}) / sqrt(2 * log(n))
# grad_{x'} k(x, x') = -(x - x') / h^2 * k(x, x')
Theoretical guarantee:
# SVGD performs steepest descent in the KL divergence:
# The update phi* minimizes:
# d/dt KL(q_t || p) at t=0
# where q_t is the distribution of particles after perturbation x -> x + t*phi(x)
# Specifically:
# d/dt KL(q_t || p)|_{t=0} = -E_q[A_p phi(x)]
# phi* = argmax_{||phi||_H <= 1} {-d/dt KL(q_t || p)}
# = E_q[k(., x) grad_x log p(x) + grad_x k(., x)] / ||...||_H