Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Pyro ppl Pyro Stein Variational Inference

From Leeroopedia


Knowledge Sources
Domains Variational Inference, Kernel Methods, Particle Methods
Last Updated 2026-02-09 09:00 GMT

Overview

Stein Variational Gradient Descent (SVGD) is a non-parametric variational inference algorithm that approximates the posterior using a set of interacting particles, combining the flexibility of particle methods with the efficiency of gradient-based optimization.

Description

Traditional variational inference restricts the approximate posterior to a parametric family (e.g., Gaussian) and optimizes its parameters. This limits expressiveness -- the true posterior may be multimodal, skewed, or otherwise poorly approximated by the chosen family.

SVGD takes a different approach: it represents the approximate posterior as a set of particles (points in parameter space) and iteratively moves them to approximate the target distribution. The particles interact through a kernel function that balances two objectives:

  1. Driving particles toward high-probability regions of the target (using the gradient of the log-posterior).
  2. Repelling particles from each other to maintain diversity and cover the full posterior (using the gradient of the kernel).

This balance is achieved through the Stein operator, which provides the direction of steepest descent in the KL divergence between the particle distribution and the target, within a reproducing kernel Hilbert space (RKHS).

Key properties of SVGD:

  • Non-parametric: The particle approximation can capture arbitrary posterior shapes including multimodality.
  • Deterministic updates: Unlike MCMC, SVGD uses deterministic gradient-based updates, making it more stable and easier to tune.
  • Parallelizable: All particles can be updated simultaneously.
  • Converges to the target: As the number of particles approaches infinity, the particle distribution converges to the true posterior.
  • Reduces to MAP with one particle: With a single particle, SVGD reduces to maximum a posteriori estimation.

Usage

Use SVGD when:

  • The posterior is expected to be multimodal and parametric variational families are inadequate.
  • You want a non-parametric approximation with gradient-based efficiency.
  • You need an ensemble of posterior samples for uncertainty quantification.
  • The target distribution has a tractable gradient of the log-density (score function).
  • You want better calibrated uncertainty estimates than mean-field VI provides.

Theoretical Basis

Stein's identity and the Stein operator:

# Stein's identity: for smooth function phi and target p:
# E_p[A_p phi(x)] = 0
# where A_p phi(x) = grad_x log p(x) * phi(x) + grad_x phi(x)

# The Stein discrepancy measures how far q is from p:
# S(q, p) = max_{phi in F} |E_q[A_p phi(x)]|^2

# When F is a unit ball in an RKHS with kernel k:
# The optimal phi* has a closed-form solution

SVGD update rule:

# Given n particles {x_1, ..., x_n} and kernel k(x, x'):
# Update each particle:

# phi*(x_i) = (1/n) * sum_{j=1}^{n} [
#     k(x_j, x_i) * grad_{x_j} log p(x_j)    # driving force
#   + grad_{x_j} k(x_j, x_i)                   # repulsive force
# ]

# x_i <- x_i + epsilon * phi*(x_i)

# The first term: moves x_i toward high-probability regions
# (weighted average of score functions at all particles,
#  with higher weight for nearby particles)

# The second term: pushes x_i away from other particles
# (prevents collapse to a single mode)

Kernel choice:

# Common choice: RBF (Gaussian) kernel
# k(x, x') = exp(-|x - x'|^2 / (2 * h^2))

# Bandwidth h is typically set via the median heuristic:
# h = median({|x_i - x_j| : i != j}) / sqrt(2 * log(n))

# grad_{x'} k(x, x') = -(x - x') / h^2 * k(x, x')

Theoretical guarantee:

# SVGD performs steepest descent in the KL divergence:
# The update phi* minimizes:
# d/dt KL(q_t || p) at t=0
# where q_t is the distribution of particles after perturbation x -> x + t*phi(x)

# Specifically:
# d/dt KL(q_t || p)|_{t=0} = -E_q[A_p phi(x)]
# phi* = argmax_{||phi||_H <= 1} {-d/dt KL(q_t || p)}
#      = E_q[k(., x) grad_x log p(x) + grad_x k(., x)] / ||...||_H

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment