Principle:Online ml River Online Probability Distributions

Knowledge Sources	Probability and Statistics Bayesian Data Analysis
Domains	Online_Learning, Probability_Theory
Last Updated	2026-02-08 18:00 GMT

Overview

Online probability distributions are incrementally estimated parametric distributions that update their parameters as each new observation arrives. Rather than fitting a distribution to a complete dataset, these objects consume data one point at a time, maintaining sufficient statistics that allow exact or approximate parameter estimation at any point in the stream.

They serve as building blocks for many online learning algorithms, including Naive Bayes classifiers, Bayesian changepoint detectors, and probabilistic anomaly detectors.

Theoretical Basis

Sufficient Statistics

A key property enabling online estimation is that many common distributions have finite sufficient statistics -- a fixed-size summary that captures all information in the data relevant to the distribution's parameters. For example:

The Gaussian distribution requires only the count n, the running sum, and the running sum of squares.
The multinomial distribution requires only the count for each category.

Gaussian Distribution

The online Gaussian maintains running estimates of mean mu and variance sigma^2 using Welford's algorithm:

n <- n + 1
delta <- x - mu
mu <- mu + delta / n
M2 <- M2 + delta * (x - mu)
sigma^2 <- M2 / (n - 1)

This is numerically stable and requires O(1) memory.

Beta Distribution

The Beta distribution Beta(alpha, beta) models probabilities on [0, 1]. In a Bayesian setting with a Bernoulli likelihood, the posterior is also Beta, and online updates simply increment:

alpha <- alpha + x       (x = 1 for success)
beta  <- beta + (1 - x)  (x = 0 for failure)

This conjugate prior relationship makes the Beta distribution ideal for online estimation of success probabilities.

Multinomial Distribution

The multinomial distribution models counts over K categories. Online estimation updates the count vector:

counts[k] <- counts[k] + observed_count[k]
P(k) = counts[k] / sum(counts)

Smoothing (e.g., Laplace) can be applied to prevent zero probabilities.

Applications

Naive Bayes classifiers: Use per-class distributions for each feature.
Anomaly detection: Flag observations with low probability under the estimated distribution.
Bayesian updating: Maintain posterior beliefs about parameters that evolve with new evidence.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment