Implementation:Online ml River Stream Shuffle
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Data_Streaming, Randomization |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Shuffles a data stream using a fixed-size buffer to randomize element order while maintaining online processing.
Description
The shuffle function implements reservoir sampling to randomize stream order with bounded memory. It maintains a buffer of elements and randomly replaces buffer items with incoming elements, then randomly samples from the buffer. Larger buffer sizes improve randomness but use more memory. This enables shuffling infinite or very large streams without loading all data into memory.
Usage
Use this when you need to randomize training data order in online learning to reduce sequential dependencies and improve model generalization. Essential for reducing concept drift artifacts and improving convergence in online learning scenarios.
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/stream/shuffling.py
Signature
def shuffle(
stream: typing.Iterator,
buffer_size: int,
seed: int | None = None
):
...
Import
from river import stream
I/O Contract
| Parameter | Type | Description |
|---|---|---|
| stream | Iterator | The stream to shuffle |
| buffer_size | int | Size of buffer holding elements (larger = more random) |
| seed | int or None | Random seed for reproducibility |
Returns:
| Type | Description |
|---|---|
| Iterator | Shuffled stream |
Usage Examples
from river import stream
# Shuffle a range of numbers
print("Original order: 0-14")
print("Shuffled order:")
for i in stream.shuffle(range(15), buffer_size=5, seed=42):
print(i, end=' ')
print()
# Shuffle a dataset
from river import datasets
dataset = datasets.Phishing()
shuffled = stream.shuffle(dataset, buffer_size=100, seed=42)
print("\nShuffled dataset (first 5):")
for i, (x, y) in enumerate(shuffled):
if i >= 5:
break
print(f"Sample {i}: {list(x.keys())[:3]}... -> {y}")
# Compare buffer sizes
print("\nBuffer size impact (seed=1):")
print("Buffer=3:", list(stream.shuffle(range(10), buffer_size=3, seed=1)))
print("Buffer=7:", list(stream.shuffle(range(10), buffer_size=7, seed=1)))
print("Buffer=10:", list(stream.shuffle(range(10), buffer_size=10, seed=1)))
# For better randomization on finite datasets, consider:
# 1. Split data into chunks
# 2. Shuffle each chunk completely
# 3. Use round-robin to interleave chunks