Implementation:OpenRLHF OpenRLHF Blending datasets
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Training_Infrastructure |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for loading and blending multiple datasets with configurable sampling provided by OpenRLHF.
Description
The blending_datasets function loads multiple datasets from comma-separated paths, supports various formats (HuggingFace Hub, local files, ModelScope), sub-selects up to max_count samples from each, and either concatenates or interleaves them based on whether probabilities are provided. It handles the @ syntax for specifying data directories and auto-detects file formats.
Usage
Call this function after strategy initialization and before creating task-specific datasets (SFTDataset, RewardDataset, etc.). Pass the result to the dataset constructor.
Code Reference
Source Location
- Repository: OpenRLHF
- File: openrlhf/datasets/utils.py
- Lines: L10-99
Signature
def blending_datasets(
datasets, # str: comma-separated dataset paths
probabilities=None, # str or None: comma-separated sampling weights
strategy=None, # DeepspeedStrategy: for logging
seed=42, # int: random seed
max_count=1e8, # int: max samples per dataset
stopping_strategy="all_exhausted", # str: interleave stopping strategy
dataset_split="train", # str: dataset split to use
) -> Dataset:
Import
from openrlhf.datasets.utils import blending_datasets
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| datasets | str | Yes | Comma-separated dataset paths (HF Hub IDs, local paths) |
| probabilities | str | No | Comma-separated sampling weights (None = concatenate) |
| strategy | DeepspeedStrategy | Yes | Strategy object for logging |
| max_count | int | No | Maximum samples per dataset (default 1e8) |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | datasets.Dataset | Blended HuggingFace Dataset |
Usage Examples
Single Dataset
from openrlhf.datasets.utils import blending_datasets
dataset = blending_datasets(
"Open-Orca/OpenOrca",
strategy=strategy,
)
Multiple Datasets with Probabilities
dataset = blending_datasets(
"Open-Orca/OpenOrca,HuggingFaceH4/ultrafeedback_binarized",
probabilities="0.7,0.3",
strategy=strategy,
max_count=50000,
)