Principle:Pyro ppl Pyro MCMC Convergence Diagnostics
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | Repo (Pyro) |
| Domains | MCMC, Bayesian_Inference, Statistics |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Diagnosing MCMC convergence using statistical measures including effective sample size (ESS), split Gelman-Rubin R-hat, and divergence counts to assess whether the Markov chain has adequately explored the posterior distribution.
Description
MCMC convergence diagnostics are statistical tools that help determine whether an MCMC sampler has run long enough and is functioning correctly. Because MCMC produces correlated samples, and because chains may fail to converge for various reasons (poor initialization, difficult geometry, insufficient warmup), diagnostics are essential for validating inference results.
Effective Sample Size (ESS)
The effective sample size measures how many independent samples an MCMC chain is equivalent to, accounting for autocorrelation between successive samples. For a chain of length N with autocorrelation function rho(k) at lag k:
ESS = N / (1 + 2 * sum_{k=1}^{inf} rho(k))
In practice, the autocorrelation sum is truncated using the initial positive sequence estimator or a similar method. Key properties of ESS:
- ESS < N always, because MCMC samples are correlated.
- Low ESS (relative to
N) indicates high autocorrelation and poor mixing. The chain is moving slowly through the parameter space. - ESS / N close to 1 indicates nearly independent samples, suggesting efficient sampling.
- Rule of thumb: ESS > 100 per chain is often considered a minimum for reliable posterior summaries.
ESS should be computed independently for each parameter, as different parameters may mix at different rates.
Split Gelman-Rubin R-hat
The R-hat statistic (also called the potential scale reduction factor) compares within-chain and between-chain variance to assess convergence. The split variant improves the original Gelman-Rubin diagnostic by splitting each chain in half before computing the statistic, which helps detect non-stationarity within individual chains.
The computation proceeds as follows:
- Split each of
Mchains of lengthNinto two halves, yielding2Mchains of lengthN/2. - Compute the within-chain variance
W(average of the variances within each half-chain). - Compute the between-chain variance
B(variance of the half-chain means, scaled byN/2). - Estimate the marginal posterior variance as
V_hat = ((N/2 - 1) / (N/2)) * W + (1 / (N/2)) * B. - Compute
R-hat = sqrt(V_hat / W).
Interpretation:
- R-hat close to 1.0 (e.g., < 1.01) indicates that all chains have converged to the same distribution.
- R-hat > 1.1 suggests that the chains have not yet converged and more samples are needed.
- R-hat >> 1 indicates severe convergence problems (chains exploring different regions of the parameter space).
Divergences
Divergent transitions are a diagnostic specific to HMC-based samplers (HMC and NUTS). A divergence occurs when the leapfrog integrator encounters a region of the parameter space where the potential energy surface is too steep for the current step size, causing the numerical integration to become unstable.
Divergences indicate:
- Problematic posterior geometry: Funnels, sharp ridges, or other pathological features that the sampler cannot navigate.
- Insufficient step size adaptation: The step size may be too large for certain regions of the posterior.
- Model specification issues: The model may benefit from reparameterization (e.g., non-centered parameterization for hierarchical models).
Even a small number of divergences (e.g., > 0 after warmup) is a cause for concern, as they indicate that the sampler may be failing to explore certain regions of the posterior.
Usage
MCMC convergence diagnostics should be checked after every MCMC run:
- Routine validation: Always check ESS, R-hat, and divergence counts before using MCMC results for downstream analysis.
- Multi-chain comparison: Running at least 2-4 chains enables meaningful R-hat computation. Single-chain R-hat (using split halves) provides a weaker but still useful check.
- Identifying problematic parameters: Parameters with low ESS or high R-hat may require model reparameterization or longer chains.
- Detecting sampler failures: Divergences indicate that the sampler is not reliably exploring the posterior and the results should not be trusted without further investigation.
Theoretical Basis
Autocorrelation and Mixing
The autocorrelation function rho(k) = Corr(theta_t, theta_{t+k}) characterizes how quickly a chain "forgets" its current state. For a well-mixing chain, autocorrelation decays rapidly with lag k. The integrated autocorrelation time tau = 1 + 2 * sum rho(k) determines the ESS via ESS = N / tau.
Variance Decomposition
The R-hat statistic is motivated by the decomposition of variance into within-chain and between-chain components. If all chains have converged to the stationary distribution:
- Within-chain variance
Wshould be an unbiased estimate of the posterior variance. - Between-chain variance
B/Nshould be close to zero (all chain means should be approximately equal).
The ratio V_hat / W converges to 1 as the chains converge, giving R-hat approaching 1.
Numerical Divergence Theory
In HMC, the leapfrog integrator preserves the Hamiltonian up to an error of order O(epsilon^2) per step. When the potential energy has high local curvature (large second derivatives), the integrator error grows, and the trajectory can diverge to regions of extremely high energy. The sampler detects this when the energy change exceeds a threshold, and these transitions are flagged as divergences.