Implementation:Online ml River Stats KolmogorovSmirnov
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Statistics, Concept_Drift |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
KolmogorovSmirnov computes incremental two-sample Kolmogorov-Smirnov statistics for comparing distributions in streaming data.
Description
This statistic quantifies the distance between the empirical cumulative distribution functions of two samples. It uses a randomized tree structure called a Treap (Cartesian Tree) with bulk operations and lazy propagation to achieve O(log N) insertion and removal time complexity, significantly faster than the O(N log N) of batch implementations. The implementation also supports Kuiper's test, which is more sensitive to differences in the tails of distributions.
Usage
Use KolmogorovSmirnov for drift detection in streaming data, comparing two distributions to determine if they are significantly different. Common applications include monitoring machine learning model performance, detecting concept drift, quality control, and A/B testing where you need to compare distributions over time.
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/stats/kolmogorov_smirnov.py
Signature
class KolmogorovSmirnov(stats.base.Bivariate):
def __init__(self, statistic="ks"):
self.treap = None
self.n_samples = 0
self.statistic = statistic
Import
from river import stats
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| x | numbers.Number | Yes | Value from first distribution |
| y | numbers.Number | Yes | Value from second distribution |
| statistic | str | Yes (init) | Method to use: "ks" or "kuiper" (default: "ks") |
Outputs
| Name | Type | Description |
|---|---|---|
| get() | float | Current KS statistic or Kuiper statistic (0 to 1) |
Usage Examples
from river import stats
# Create two streams with different distributions
stream_a = [1, 1, 2, 2, 3, 3, 4, 4]
stream_b = [1, 1, 1, 1, 2, 2, 2, 2]
# Compute incremental Kolmogorov-Smirnov statistic
incremental_ks = stats.KolmogorovSmirnov(statistic="ks")
for a, b in zip(stream_a, stream_b):
incremental_ks.update(a, b)
print(f"KS statistic: {incremental_ks.get()}")
# Output: 0.5
print(f"Number of samples: {incremental_ks.n_samples}")
# Output: 8
# Using Kuiper's test for tail sensitivity
kuiper_stat = stats.KolmogorovSmirnov(statistic="kuiper")
for a, b in zip(stream_a, stream_b):
kuiper_stat.update(a, b)
print(f"Kuiper statistic: {kuiper_stat.get():.3f}")
# Monitoring for drift
import numpy as np
ks_drift = stats.KolmogorovSmirnov()
# Reference distribution (normal)
for _ in range(100):
x = np.random.normal(0, 1)
y = np.random.normal(0, 1)
ks_drift.update(x, y)
print(f"Same distribution KS: {ks_drift.get():.4f}")
# Different distribution (shifted mean)
ks_drift2 = stats.KolmogorovSmirnov()
for _ in range(100):
x = np.random.normal(0, 1)
y = np.random.normal(2, 1) # Different mean
ks_drift2.update(x, y)
print(f"Different distribution KS: {ks_drift2.get():.4f}")