Principle:Online ml River Online Probability Distributions
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Probability_Theory |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Online probability distributions are incrementally estimated parametric distributions that update their parameters as each new observation arrives. Rather than fitting a distribution to a complete dataset, these objects consume data one point at a time, maintaining sufficient statistics that allow exact or approximate parameter estimation at any point in the stream.
They serve as building blocks for many online learning algorithms, including Naive Bayes classifiers, Bayesian changepoint detectors, and probabilistic anomaly detectors.
Theoretical Basis
Sufficient Statistics
A key property enabling online estimation is that many common distributions have finite sufficient statistics -- a fixed-size summary that captures all information in the data relevant to the distribution's parameters. For example:
- The Gaussian distribution requires only the count n, the running sum, and the running sum of squares.
- The multinomial distribution requires only the count for each category.
Gaussian Distribution
The online Gaussian maintains running estimates of mean mu and variance sigma^2 using Welford's algorithm:
n <- n + 1
delta <- x - mu
mu <- mu + delta / n
M2 <- M2 + delta * (x - mu)
sigma^2 <- M2 / (n - 1)
This is numerically stable and requires O(1) memory.
Beta Distribution
The Beta distribution Beta(alpha, beta) models probabilities on [0, 1]. In a Bayesian setting with a Bernoulli likelihood, the posterior is also Beta, and online updates simply increment:
alpha <- alpha + x (x = 1 for success)
beta <- beta + (1 - x) (x = 0 for failure)
This conjugate prior relationship makes the Beta distribution ideal for online estimation of success probabilities.
Multinomial Distribution
The multinomial distribution models counts over K categories. Online estimation updates the count vector:
counts[k] <- counts[k] + observed_count[k]
P(k) = counts[k] / sum(counts)
Smoothing (e.g., Laplace) can be applied to prevent zero probabilities.
Applications
- Naive Bayes classifiers: Use per-class distributions for each feature.
- Anomaly detection: Flag observations with low probability under the estimated distribution.
- Bayesian updating: Maintain posterior beliefs about parameters that evolve with new evidence.