Implementation:Online ml River Stats NUnique

Knowledge Sources	Online_ml_River
Domains	Online_Learning, Statistics
Last Updated	2026-02-08 16:00 GMT

Overview

NUnique provides an approximate count of unique values in a data stream using the HyperLogLog algorithm.

Description

This statistic estimates the number of distinct values observed in streaming data without storing all unique values. It uses the HyperLogLog probabilistic algorithm, which provides memory-efficient cardinality estimation. The accuracy is controlled by the error_rate parameter, with lower error rates requiring more memory but providing better estimates. This implementation is adapted from the hypy library.

Usage

Use NUnique when you need to count distinct values in large data streams where storing all unique values would be impractical. Common applications include counting unique users, unique IP addresses, distinct categories, vocabulary size in text streams, and cardinality estimation in database query optimization.

Code Reference

Source Location

Repository: Online_ml_River
File: river/stats/n_unique.py

Signature

class NUnique(stats.base.Univariate):
    P32 = 2**32

    def __init__(self, error_rate=0.01, seed: int | None = None):
        self.error_rate = error_rate
        self.seed = seed
        self.n_bits = int(math.ceil(math.log((1.04 / error_rate) ** 2, 2)))
        self.n_buckets = 1 << self.n_bits
        self.buckets = [0] * self.n_buckets
        self._salt = np.random.RandomState(seed).bytes(hashlib.blake2s.SALT_SIZE)

Import

from river import stats

I/O Contract

Inputs

Name	Type	Required	Description
x	str (hashable)	Yes	Value to update the statistic with (converted to string)
error_rate	float	Yes (init)	Desired error rate, controls accuracy vs memory (default: 0.01)
seed	int	No (init)	Random seed for reproducibility (default: None)

Outputs

Name	Type	Description
get()	int	Estimated count of unique values

Usage Examples

from river import stats
import string

# Count unique letters with moderate error rate
alphabet = string.ascii_lowercase
n_unique = stats.NUnique(error_rate=0.2, seed=42)

n_unique.update('a')
print(f"After 'a': {n_unique.get()}")
# Output: 1

n_unique.update('b')
print(f"After 'b': {n_unique.get()}")
# Output: 2

# Process all letters
for letter in alphabet:
    n_unique.update(letter)

print(f"Estimated unique letters: {n_unique.get()}")
# Output: 31 (actual: 26, some error due to error_rate=0.2)

# Higher precision with lower error rate
n_unique_precise = stats.NUnique(error_rate=0.01, seed=42)
for letter in alphabet:
    n_unique_precise.update(letter)

print(f"Precise estimate: {n_unique_precise.get()}")
# Output: 26 (closer to actual count)

# Counting unique users in a stream
user_counter = stats.NUnique(error_rate=0.01)
user_stream = ['user1', 'user2', 'user1', 'user3', 'user2', 'user4'] * 100

for user in user_stream:
    user_counter.update(user)

print(f"Estimated unique users: {user_counter.get()}")
# Output: ~4 (actual: 4 unique users)

# Large-scale unique counting
large_stream = stats.NUnique(error_rate=0.05)
for i in range(10000):
    large_stream.update(str(i % 1000))  # 1000 unique values

print(f"Unique values estimate: {large_stream.get()}")
# Output: close to 1000

Related Pages

Environment:Online_ml_River_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment