Implementation:Online ml River Stats NUnique
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Statistics |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
NUnique provides an approximate count of unique values in a data stream using the HyperLogLog algorithm.
Description
This statistic estimates the number of distinct values observed in streaming data without storing all unique values. It uses the HyperLogLog probabilistic algorithm, which provides memory-efficient cardinality estimation. The accuracy is controlled by the error_rate parameter, with lower error rates requiring more memory but providing better estimates. This implementation is adapted from the hypy library.
Usage
Use NUnique when you need to count distinct values in large data streams where storing all unique values would be impractical. Common applications include counting unique users, unique IP addresses, distinct categories, vocabulary size in text streams, and cardinality estimation in database query optimization.
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/stats/n_unique.py
Signature
class NUnique(stats.base.Univariate):
P32 = 2**32
def __init__(self, error_rate=0.01, seed: int | None = None):
self.error_rate = error_rate
self.seed = seed
self.n_bits = int(math.ceil(math.log((1.04 / error_rate) ** 2, 2)))
self.n_buckets = 1 << self.n_bits
self.buckets = [0] * self.n_buckets
self._salt = np.random.RandomState(seed).bytes(hashlib.blake2s.SALT_SIZE)
Import
from river import stats
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| x | str (hashable) | Yes | Value to update the statistic with (converted to string) |
| error_rate | float | Yes (init) | Desired error rate, controls accuracy vs memory (default: 0.01) |
| seed | int | No (init) | Random seed for reproducibility (default: None) |
Outputs
| Name | Type | Description |
|---|---|---|
| get() | int | Estimated count of unique values |
Usage Examples
from river import stats
import string
# Count unique letters with moderate error rate
alphabet = string.ascii_lowercase
n_unique = stats.NUnique(error_rate=0.2, seed=42)
n_unique.update('a')
print(f"After 'a': {n_unique.get()}")
# Output: 1
n_unique.update('b')
print(f"After 'b': {n_unique.get()}")
# Output: 2
# Process all letters
for letter in alphabet:
n_unique.update(letter)
print(f"Estimated unique letters: {n_unique.get()}")
# Output: 31 (actual: 26, some error due to error_rate=0.2)
# Higher precision with lower error rate
n_unique_precise = stats.NUnique(error_rate=0.01, seed=42)
for letter in alphabet:
n_unique_precise.update(letter)
print(f"Precise estimate: {n_unique_precise.get()}")
# Output: 26 (closer to actual count)
# Counting unique users in a stream
user_counter = stats.NUnique(error_rate=0.01)
user_stream = ['user1', 'user2', 'user1', 'user3', 'user2', 'user4'] * 100
for user in user_stream:
user_counter.update(user)
print(f"Estimated unique users: {user_counter.get()}")
# Output: ~4 (actual: 4 unique users)
# Large-scale unique counting
large_stream = stats.NUnique(error_rate=0.05)
for i in range(10000):
large_stream.update(str(i % 1000)) # 1000 unique values
print(f"Unique values estimate: {large_stream.get()}")
# Output: close to 1000