Implementation:Online ml River Sketch Set
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Streaming_Algorithms, Bloom_Filters |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Bloom filter for approximate set membership testing with fixed memory and no false negatives.
Description
Implements a Bloom filter using a bit array and multiple hash functions for probabilistic set membership testing. Items are stored as binary codes by setting multiple bit positions. Membership queries never produce false negatives but may produce false positives (claiming an item exists when it doesn't). Supports set union and intersection operations between compatible filters.
Usage
Use for detecting duplicate items in streams, cache filtering, or database query optimization where false positives are acceptable but false negatives are not. Memory-efficient alternative to storing all seen items explicitly.
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/sketch/set.py
Signature
class Set(base.Base):
def __init__(self, capacity: int = 2048, fp_rate: float = 0.01, seed: int | None = None):
...
def add(self, x: typing.Hashable):
...
def update(self, values: typing.Iterable):
...
def __contains__(self, x: typing.Hashable):
...
def union(self, other: Set):
...
def intersection(self, other: Set):
...
Import
from river import sketch
I/O Contract
| Parameter | Type | Description |
|---|---|---|
| capacity | int | Maximum distinct items to store (default: 2048) |
| fp_rate | float | False positive rate (default: 0.01) |
| seed | int or None | Random seed for hash functions |
Usage Examples
import random
from river import sketch
rng = random.Random(42)
s_set = sketch.Set(capacity=100, seed=0)
print(f"Hash functions: {s_set.n_hash}")
print(f"Bit array size: {s_set.n_bits}")
# Add items
for _ in range(1000):
s_set.add(rng.randint(0, 200))
# Check membership
print(f"\n1 in set: {1 in s_set}")
print(f"-10 in set (false positive): {-10 in s_set}")
# Batch update
s_set.update([1, 2, 3, 4, 5, 6, 7])
# Set operations
s1 = sketch.Set(seed=8)
s2 = sketch.Set(seed=8)
for _ in range(1000):
s1.add(rng.randint(0, 5000))
s2.add(rng.randint(0, 5000))
# Intersection
s_int = s1 & s2
print(f"\n43 in s1: {43 in s1}")
print(f"43 in s2: {43 in s2}")
print(f"43 in intersection: {43 in s_int}")
# Union
s_union = s1 | s2
print(f"43 in union: {43 in s_union}")