Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenRLHF OpenRLHF Blending datasets

From Leeroopedia
Revision as of 16:15, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/OpenRLHF_OpenRLHF_Blending_datasets.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data_Processing, Training_Infrastructure
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for loading and blending multiple datasets with configurable sampling provided by OpenRLHF.

Description

The blending_datasets function loads multiple datasets from comma-separated paths, supports various formats (HuggingFace Hub, local files, ModelScope), sub-selects up to max_count samples from each, and either concatenates or interleaves them based on whether probabilities are provided. It handles the @ syntax for specifying data directories and auto-detects file formats.

Usage

Call this function after strategy initialization and before creating task-specific datasets (SFTDataset, RewardDataset, etc.). Pass the result to the dataset constructor.

Code Reference

Source Location

  • Repository: OpenRLHF
  • File: openrlhf/datasets/utils.py
  • Lines: L10-99

Signature

def blending_datasets(
    datasets,                    # str: comma-separated dataset paths
    probabilities=None,          # str or None: comma-separated sampling weights
    strategy=None,               # DeepspeedStrategy: for logging
    seed=42,                     # int: random seed
    max_count=1e8,               # int: max samples per dataset
    stopping_strategy="all_exhausted",  # str: interleave stopping strategy
    dataset_split="train",       # str: dataset split to use
) -> Dataset:

Import

from openrlhf.datasets.utils import blending_datasets

I/O Contract

Inputs

Name Type Required Description
datasets str Yes Comma-separated dataset paths (HF Hub IDs, local paths)
probabilities str No Comma-separated sampling weights (None = concatenate)
strategy DeepspeedStrategy Yes Strategy object for logging
max_count int No Maximum samples per dataset (default 1e8)

Outputs

Name Type Description
dataset datasets.Dataset Blended HuggingFace Dataset

Usage Examples

Single Dataset

from openrlhf.datasets.utils import blending_datasets

dataset = blending_datasets(
    "Open-Orca/OpenOrca",
    strategy=strategy,
)

Multiple Datasets with Probabilities

dataset = blending_datasets(
    "Open-Orca/OpenOrca,HuggingFaceH4/ultrafeedback_binarized",
    probabilities="0.7,0.3",
    strategy=strategy,
    max_count=50000,
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment