Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Stream Shuffle

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Data_Streaming, Randomization
Last Updated 2026-02-08 16:00 GMT

Overview

Shuffles a data stream using a fixed-size buffer to randomize element order while maintaining online processing.

Description

The shuffle function implements reservoir sampling to randomize stream order with bounded memory. It maintains a buffer of elements and randomly replaces buffer items with incoming elements, then randomly samples from the buffer. Larger buffer sizes improve randomness but use more memory. This enables shuffling infinite or very large streams without loading all data into memory.

Usage

Use this when you need to randomize training data order in online learning to reduce sequential dependencies and improve model generalization. Essential for reducing concept drift artifacts and improving convergence in online learning scenarios.

Code Reference

Source Location

Signature

def shuffle(
    stream: typing.Iterator,
    buffer_size: int,
    seed: int | None = None
):
    ...

Import

from river import stream

I/O Contract

Parameter Type Description
stream Iterator The stream to shuffle
buffer_size int Size of buffer holding elements (larger = more random)
seed int or None Random seed for reproducibility

Returns:

Type Description
Iterator Shuffled stream

Usage Examples

from river import stream

# Shuffle a range of numbers
print("Original order: 0-14")
print("Shuffled order:")
for i in stream.shuffle(range(15), buffer_size=5, seed=42):
    print(i, end=' ')
print()

# Shuffle a dataset
from river import datasets

dataset = datasets.Phishing()
shuffled = stream.shuffle(dataset, buffer_size=100, seed=42)

print("\nShuffled dataset (first 5):")
for i, (x, y) in enumerate(shuffled):
    if i >= 5:
        break
    print(f"Sample {i}: {list(x.keys())[:3]}... -> {y}")

# Compare buffer sizes
print("\nBuffer size impact (seed=1):")
print("Buffer=3:", list(stream.shuffle(range(10), buffer_size=3, seed=1)))
print("Buffer=7:", list(stream.shuffle(range(10), buffer_size=7, seed=1)))
print("Buffer=10:", list(stream.shuffle(range(10), buffer_size=10, seed=1)))

# For better randomization on finite datasets, consider:
# 1. Split data into chunks
# 2. Shuffle each chunk completely
# 3. Use round-robin to interleave chunks

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment