Principle:Pyro ppl Pyro Online Statistics
| Knowledge Sources | |
|---|---|
| Domains | Online Algorithms, Statistical Computing, Streaming Data |
| Last Updated | 2026-02-09 09:00 GMT |
Overview
Online (streaming) algorithms compute running statistics such as mean, variance, and covariance incrementally as data arrives, using constant memory and avoiding the need to store all past observations.
Description
In many settings -- particularly during MCMC sampling and online learning -- statistics must be computed incrementally as new data points arrive. Storing all observations to recompute statistics from scratch is impractical due to memory constraints and computational cost.
Welford's algorithm provides a numerically stable online algorithm for computing the running mean and variance. Unlike the naive algorithm (which computes mean and sum-of-squares separately), Welford's algorithm maintains a running corrected sum of squares that avoids catastrophic cancellation.
The algorithm extends naturally to:
Online covariance estimation (Welford covariance): Maintains a running estimate of the full covariance matrix, updated with each new observation in O(d^2) time for d-dimensional data. This is essential for:
- Adapting the mass matrix in HMC during warmup.
- Estimating posterior covariance for diagnostics.
- Building proposal distributions in adaptive MCMC.
Streaming statistics: More general online algorithms that maintain running estimates of various summary statistics:
- Mean and variance (Welford).
- Quantiles (approximate, using t-digest or similar).
- Higher moments (skewness, kurtosis).
- Exponentially weighted moving averages for non-stationary data.
The key properties of these algorithms are:
- Constant memory: O(d^2) for d-dimensional covariance, independent of the number of observations.
- Single pass: Each observation is processed once and discarded.
- Numerical stability: Avoid the catastrophic cancellation that plagues naive implementations.
- Exact equivalence: The running statistics are exactly equal to what batch computation would produce (for mean and variance).
Usage
Use online statistics when:
- Computing running mean and covariance during MCMC warmup for mass matrix adaptation.
- Monitoring convergence of MCMC chains in real-time.
- Processing streaming data where storing all observations is impractical.
- Implementing adaptive algorithms that adjust based on accumulated statistics.
Theoretical Basis
Welford's online mean and variance:
# Initialize:
# n = 0, mean = 0, M2 = 0
# Update with new observation x:
# n = n + 1
# delta = x - mean
# mean = mean + delta / n
# delta2 = x - mean # note: using updated mean
# M2 = M2 + delta * delta2
# Retrieve:
# variance = M2 / (n - 1) # sample variance (unbiased)
# population_variance = M2 / n # population variance
# Why stable: avoids computing sum(x_i^2) - n*mean^2
# which suffers from catastrophic cancellation when mean >> std
Welford's online covariance (multivariate):
# Initialize:
# n = 0, mean = zeros(d), C = zeros(d, d)
# Update with new observation x (d-dimensional vector):
# n = n + 1
# delta = x - mean
# mean = mean + delta / n
# delta2 = x - mean # using updated mean
# C = C + outer(delta, delta2)
# Retrieve:
# covariance = C / (n - 1) # sample covariance matrix
# Properties:
# - O(d^2) time per update
# - O(d^2) memory total
# - Numerically stable
# - Exactly equals batch covariance for any n
Combining running statistics (parallel/distributed):
# Given two sets of running statistics:
# Set A: n_a, mean_a, M2_a
# Set B: n_b, mean_b, M2_b
# Combined statistics:
# n = n_a + n_b
# delta = mean_b - mean_a
# mean = mean_a + delta * n_b / n
# M2 = M2_a + M2_b + delta^2 * n_a * n_b / n
# This enables parallel computation and chain combination
Application to MCMC mass matrix adaptation:
# During HMC warmup:
# 1. Run initial samples with identity mass matrix
# 2. Use Welford covariance to estimate posterior covariance C
# 3. Set mass matrix M = C (or M = diag(C) for diagonal adaptation)
# 4. Continue sampling with adapted mass matrix
# 5. Repeat adaptation in windows of increasing length
# The mass matrix preconditions the Hamiltonian dynamics:
# K(p) = 0.5 * p^T M^{-1} p
# Good M approx posterior covariance => isotropic energy landscape