Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Stats NUnique

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Statistics
Last Updated 2026-02-08 16:00 GMT

Overview

NUnique provides an approximate count of unique values in a data stream using the HyperLogLog algorithm.

Description

This statistic estimates the number of distinct values observed in streaming data without storing all unique values. It uses the HyperLogLog probabilistic algorithm, which provides memory-efficient cardinality estimation. The accuracy is controlled by the error_rate parameter, with lower error rates requiring more memory but providing better estimates. This implementation is adapted from the hypy library.

Usage

Use NUnique when you need to count distinct values in large data streams where storing all unique values would be impractical. Common applications include counting unique users, unique IP addresses, distinct categories, vocabulary size in text streams, and cardinality estimation in database query optimization.

Code Reference

Source Location

Signature

class NUnique(stats.base.Univariate):
    P32 = 2**32

    def __init__(self, error_rate=0.01, seed: int | None = None):
        self.error_rate = error_rate
        self.seed = seed
        self.n_bits = int(math.ceil(math.log((1.04 / error_rate) ** 2, 2)))
        self.n_buckets = 1 << self.n_bits
        self.buckets = [0] * self.n_buckets
        self._salt = np.random.RandomState(seed).bytes(hashlib.blake2s.SALT_SIZE)

Import

from river import stats

I/O Contract

Inputs

Name Type Required Description
x str (hashable) Yes Value to update the statistic with (converted to string)
error_rate float Yes (init) Desired error rate, controls accuracy vs memory (default: 0.01)
seed int No (init) Random seed for reproducibility (default: None)

Outputs

Name Type Description
get() int Estimated count of unique values

Usage Examples

from river import stats
import string

# Count unique letters with moderate error rate
alphabet = string.ascii_lowercase
n_unique = stats.NUnique(error_rate=0.2, seed=42)

n_unique.update('a')
print(f"After 'a': {n_unique.get()}")
# Output: 1

n_unique.update('b')
print(f"After 'b': {n_unique.get()}")
# Output: 2

# Process all letters
for letter in alphabet:
    n_unique.update(letter)

print(f"Estimated unique letters: {n_unique.get()}")
# Output: 31 (actual: 26, some error due to error_rate=0.2)

# Higher precision with lower error rate
n_unique_precise = stats.NUnique(error_rate=0.01, seed=42)
for letter in alphabet:
    n_unique_precise.update(letter)

print(f"Precise estimate: {n_unique_precise.get()}")
# Output: 26 (closer to actual count)

# Counting unique users in a stream
user_counter = stats.NUnique(error_rate=0.01)
user_stream = ['user1', 'user2', 'user1', 'user3', 'user2', 'user4'] * 100

for user in user_stream:
    user_counter.update(user)

print(f"Estimated unique users: {user_counter.get()}")
# Output: ~4 (actual: 4 unique users)

# Large-scale unique counting
large_stream = stats.NUnique(error_rate=0.05)
for i in range(10000):
    large_stream.update(str(i % 1000))  # 1000 unique values

print(f"Unique values estimate: {large_stream.get()}")
# Output: close to 1000

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment