Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:OpenRLHF OpenRLHF Dataset Blending

From Leeroopedia
Revision as of 18:03, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/OpenRLHF_OpenRLHF_Dataset_Blending.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data_Processing, Training_Infrastructure
Last Updated 2026-02-07 00:00 GMT

Overview

A data preparation technique that combines multiple datasets with configurable sampling probabilities into a single unified training dataset.

Description

Dataset Blending addresses the common need to train on heterogeneous data sources. Rather than requiring manual dataset merging, it loads datasets from various formats (HuggingFace Hub, local JSON/JSONL/CSV/Parquet files, saved datasets), optionally sub-samples each, and either concatenates them directly or interleaves them with specified sampling probabilities.

This enables curriculum-like training where different data sources contribute different proportions, or simple multi-dataset training where all sources are used equally.

Usage

Use this principle whenever training data comes from multiple sources. It is used in every OpenRLHF training workflow (SFT, RM, DPO, KD) before dataset-specific processing.

Theoretical Basis

Concatenation mode (no probabilities): All datasets are simply concatenated end-to-end.

Interleaving mode (with probabilities): Samples are drawn from each dataset with specified probability, using HuggingFace's interleave_datasets:

# Abstract algorithm
for each training step:
    dataset_idx = sample_categorical(probabilities)
    batch = next(iterators[dataset_idx])

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment