Principle:Online ml River Streaming Statistics
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Descriptive_Statistics |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Streaming statistics are incrementally computed summary measures that process each data point exactly once and maintain a fixed-size state, regardless of stream length. They provide the same descriptive measures as batch statistics (mean, variance, quantiles, correlations) but compute them in a single pass with O(1) memory per statistic.
These are fundamental building blocks for online learning systems, serving as feature preprocessors, drift detectors, monitoring tools, and components within larger models.
Theoretical Basis
First-Order Statistics
- Count: Tracks the number of observations seen: n <- n + 1.
- Sum: Running total: S <- S + x.
- Mean: Computed as S/n or via Welford's incremental update: mu <- mu + (x - mu) / n.
- Exponentially weighted mean (EWMean): mu_t = alpha * x_t + (1 - alpha) * mu_{t-1}, giving more weight to recent values.
Second-Order Statistics
- Variance: Uses Welford's algorithm for numerical stability:
delta = x - mu
mu <- mu + delta / n
M2 <- M2 + delta * (x - mu)
variance = M2 / (n - 1)
- Exponentially weighted variance (EWVar): Applies exponential weighting to track non-stationary variance.
- Covariance: Incrementally updated between two variables using a similar two-pass-free approach.
- Pearson correlation: Computed from online covariance and standard deviations: r = Cov(X,Y) / (sigma_X * sigma_Y).
Higher-Order Statistics
- Skewness: Measures asymmetry of the distribution. Updated via running third central moment.
- Kurtosis: Measures tail heaviness. Updated via running fourth central moment.
Order Statistics and Quantiles
- Minimum / Maximum: Trivially maintained by comparison with each new value.
- Peak-to-peak: max - min, the range of observed values.
- Quantile: Approximated using algorithms such as P-squared (deterministic) or t-digest (merge-friendly). These maintain a compact representation that answers quantile queries with bounded error.
- IQR (Interquartile range): Q3 - Q1, computed from streaming quantile estimators.
- MAD (Median absolute deviation): A robust measure of spread, approximated via streaming median estimation.
Information-Theoretic Statistics
- Entropy: Estimated from streaming frequency counts: H = -sum p_k * log(p_k).
- Kolmogorov-Smirnov statistic: Measures the maximum difference between two empirical CDFs, adapted for streaming comparison.
Other Statistics
- Mode: The most frequently observed value, tracked via a frequency counter.
- NUnique: The number of distinct values, which can be exact (with a set) or approximate (with HyperLogLog).
- AutoCorrelation: Correlation between a signal and its lagged version, requiring a rolling window buffer.
- Shift: Provides access to lagged values from the stream.
- SEM (Standard error of the mean): sigma / sqrt(n), computed from online variance.
Related Pages
- Implementation:Online_ml_River_Stats_AutoCorr
- Implementation:Online_ml_River_Stats_Count
- Implementation:Online_ml_River_Stats_Cov
- Implementation:Online_ml_River_Stats_EWMean
- Implementation:Online_ml_River_Stats_EWVar
- Implementation:Online_ml_River_Stats_Entropy
- Implementation:Online_ml_River_Stats_IQR
- Implementation:Online_ml_River_Stats_KolmogorovSmirnov
- Implementation:Online_ml_River_Stats_Kurtosis
- Implementation:Online_ml_River_Stats_MAD
- Implementation:Online_ml_River_Stats_Maximum
- Implementation:Online_ml_River_Stats_Mean
- Implementation:Online_ml_River_Stats_Minimum
- Implementation:Online_ml_River_Stats_Mode
- Implementation:Online_ml_River_Stats_NUnique
- Implementation:Online_ml_River_Stats_PeakToPeak
- Implementation:Online_ml_River_Stats_PearsonCorr
- Implementation:Online_ml_River_Stats_Quantile
- Implementation:Online_ml_River_Stats_SEM
- Implementation:Online_ml_River_Stats_Shift
- Implementation:Online_ml_River_Stats_Skew
- Implementation:Online_ml_River_Stats_Sum
- Implementation:Online_ml_River_Stats_Var
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment