Principle:OpenRLHF OpenRLHF Dataset Blending
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Training_Infrastructure |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A data preparation technique that combines multiple datasets with configurable sampling probabilities into a single unified training dataset.
Description
Dataset Blending addresses the common need to train on heterogeneous data sources. Rather than requiring manual dataset merging, it loads datasets from various formats (HuggingFace Hub, local JSON/JSONL/CSV/Parquet files, saved datasets), optionally sub-samples each, and either concatenates them directly or interleaves them with specified sampling probabilities.
This enables curriculum-like training where different data sources contribute different proportions, or simple multi-dataset training where all sources are used equally.
Usage
Use this principle whenever training data comes from multiple sources. It is used in every OpenRLHF training workflow (SFT, RM, DPO, KD) before dataset-specific processing.
Theoretical Basis
Concatenation mode (no probabilities): All datasets are simply concatenated end-to-end.
Interleaving mode (with probabilities): Samples are drawn from each dataset with specified probability, using HuggingFace's interleave_datasets:
# Abstract algorithm
for each training step:
dataset_idx = sample_categorical(probabilities)
batch = next(iterators[dataset_idx])