Principle:Huggingface Trl GRPO Prompt Dataset Loading

Property	Value
Principle Name	GRPO Prompt Dataset Loading
Library	Huggingface TRL
Category	Data Loading / Online RL

Overview

Description

GRPO training operates in an online generation paradigm: the model generates its own completions during training, rather than learning from pre-existing (prompt, completion) pairs. This fundamentally changes the data requirements compared to supervised fine-tuning. In GRPO, the training dataset contains only prompts (or prompts with metadata columns like solution for reward computation), and the model generates completions on-the-fly.

The GRPO Prompt Dataset Loading principle covers how datasets are loaded, mixed, and prepared for the online RL training pipeline. TRL supports two dataset loading paths: a simple single-dataset path via dataset_name, and a mixture-of-datasets path via the DatasetMixtureConfig system.

Usage

Datasets for GRPO must contain at minimum a "prompt" column. They may also contain additional columns (e.g., "solution") that are forwarded to reward functions as keyword arguments. The prompt format can be either:

Standard: Each prompt is a plain text string
Conversational: Each prompt is a list of message dictionaries with "role" and "content" keys

The dataset loading supports:

Single dataset via dataset_name and dataset_config
Multiple datasets via DatasetMixtureConfig with per-dataset column selection
Optional train/test splitting via test_split_size
Streaming mode for large datasets

Theoretical Basis

The distinction between offline and online RL training paradigms is central to understanding GRPO's data requirements:

Offline RL (e.g., DPO, KTO) trains on pre-collected preference data. The dataset contains both prompts and completions (often pairs of chosen/rejected responses). The policy never generates text during training.

Online RL (e.g., GRPO, PPO) requires the current policy to generate completions. This means:

The dataset needs only prompts, not completions
Data diversity comes from the model's sampling (controlled by temperature, top_p, etc.) rather than from pre-collected responses
Additional columns in the dataset serve as metadata for reward computation, not as training targets

The mixture-of-datasets approach via DatasetMixtureConfig enables multi-task training scenarios where prompts from different domains or difficulty levels are combined. Each dataset in the mixture can specify a subset of columns to include, allowing heterogeneous dataset schemas to be unified. The datasets are concatenated after loading and optional column selection.

When remove_unused_columns is set to False (the default in GRPO), all columns are preserved and forwarded to reward functions. This is essential because reward functions often need access to ground-truth solutions or other metadata that would otherwise be stripped by the standard Trainer column-removal logic.

Related Pages

Implementation:Huggingface_Trl_Get_Dataset_GRPO

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment