Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Stats KolmogorovSmirnov

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Statistics, Concept_Drift
Last Updated 2026-02-08 16:00 GMT

Overview

KolmogorovSmirnov computes incremental two-sample Kolmogorov-Smirnov statistics for comparing distributions in streaming data.

Description

This statistic quantifies the distance between the empirical cumulative distribution functions of two samples. It uses a randomized tree structure called a Treap (Cartesian Tree) with bulk operations and lazy propagation to achieve O(log N) insertion and removal time complexity, significantly faster than the O(N log N) of batch implementations. The implementation also supports Kuiper's test, which is more sensitive to differences in the tails of distributions.

Usage

Use KolmogorovSmirnov for drift detection in streaming data, comparing two distributions to determine if they are significantly different. Common applications include monitoring machine learning model performance, detecting concept drift, quality control, and A/B testing where you need to compare distributions over time.

Code Reference

Source Location

Signature

class KolmogorovSmirnov(stats.base.Bivariate):
    def __init__(self, statistic="ks"):
        self.treap = None
        self.n_samples = 0
        self.statistic = statistic

Import

from river import stats

I/O Contract

Inputs

Name Type Required Description
x numbers.Number Yes Value from first distribution
y numbers.Number Yes Value from second distribution
statistic str Yes (init) Method to use: "ks" or "kuiper" (default: "ks")

Outputs

Name Type Description
get() float Current KS statistic or Kuiper statistic (0 to 1)

Usage Examples

from river import stats

# Create two streams with different distributions
stream_a = [1, 1, 2, 2, 3, 3, 4, 4]
stream_b = [1, 1, 1, 1, 2, 2, 2, 2]

# Compute incremental Kolmogorov-Smirnov statistic
incremental_ks = stats.KolmogorovSmirnov(statistic="ks")

for a, b in zip(stream_a, stream_b):
    incremental_ks.update(a, b)

print(f"KS statistic: {incremental_ks.get()}")
# Output: 0.5

print(f"Number of samples: {incremental_ks.n_samples}")
# Output: 8

# Using Kuiper's test for tail sensitivity
kuiper_stat = stats.KolmogorovSmirnov(statistic="kuiper")

for a, b in zip(stream_a, stream_b):
    kuiper_stat.update(a, b)

print(f"Kuiper statistic: {kuiper_stat.get():.3f}")

# Monitoring for drift
import numpy as np

ks_drift = stats.KolmogorovSmirnov()

# Reference distribution (normal)
for _ in range(100):
    x = np.random.normal(0, 1)
    y = np.random.normal(0, 1)
    ks_drift.update(x, y)

print(f"Same distribution KS: {ks_drift.get():.4f}")

# Different distribution (shifted mean)
ks_drift2 = stats.KolmogorovSmirnov()
for _ in range(100):
    x = np.random.normal(0, 1)
    y = np.random.normal(2, 1)  # Different mean
    ks_drift2.update(x, y)

print(f"Different distribution KS: {ks_drift2.get():.4f}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment