Principle:Huggingface Trl GRPO Prompt Dataset Loading
| Property | Value |
|---|---|
| Principle Name | GRPO Prompt Dataset Loading |
| Library | Huggingface TRL |
| Category | Data Loading / Online RL |
Overview
Description
GRPO training operates in an online generation paradigm: the model generates its own completions during training, rather than learning from pre-existing (prompt, completion) pairs. This fundamentally changes the data requirements compared to supervised fine-tuning. In GRPO, the training dataset contains only prompts (or prompts with metadata columns like solution for reward computation), and the model generates completions on-the-fly.
The GRPO Prompt Dataset Loading principle covers how datasets are loaded, mixed, and prepared for the online RL training pipeline. TRL supports two dataset loading paths: a simple single-dataset path via dataset_name, and a mixture-of-datasets path via the DatasetMixtureConfig system.
Usage
Datasets for GRPO must contain at minimum a "prompt" column. They may also contain additional columns (e.g., "solution") that are forwarded to reward functions as keyword arguments. The prompt format can be either:
- Standard: Each prompt is a plain text string
- Conversational: Each prompt is a list of message dictionaries with
"role"and"content"keys
The dataset loading supports:
- Single dataset via
dataset_nameanddataset_config - Multiple datasets via
DatasetMixtureConfigwith per-dataset column selection - Optional train/test splitting via
test_split_size - Streaming mode for large datasets
Theoretical Basis
The distinction between offline and online RL training paradigms is central to understanding GRPO's data requirements:
Offline RL (e.g., DPO, KTO) trains on pre-collected preference data. The dataset contains both prompts and completions (often pairs of chosen/rejected responses). The policy never generates text during training.
Online RL (e.g., GRPO, PPO) requires the current policy to generate completions. This means:
- The dataset needs only prompts, not completions
- Data diversity comes from the model's sampling (controlled by
temperature,top_p, etc.) rather than from pre-collected responses - Additional columns in the dataset serve as metadata for reward computation, not as training targets
The mixture-of-datasets approach via DatasetMixtureConfig enables multi-task training scenarios where prompts from different domains or difficulty levels are combined. Each dataset in the mixture can specify a subset of columns to include, allowing heterogeneous dataset schemas to be unified. The datasets are concatenated after loading and optional column selection.
When remove_unused_columns is set to False (the default in GRPO), all columns are preserved and forwarded to reward functions. This is essential because reward functions often need access to ground-truth solutions or other metadata that would otherwise be stripped by the standard Trainer column-removal logic.