Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Huggingface Trl GRPO Prompt Dataset Loading

From Leeroopedia


Property Value
Principle Name GRPO Prompt Dataset Loading
Library Huggingface TRL
Category Data Loading / Online RL

Overview

Description

GRPO training operates in an online generation paradigm: the model generates its own completions during training, rather than learning from pre-existing (prompt, completion) pairs. This fundamentally changes the data requirements compared to supervised fine-tuning. In GRPO, the training dataset contains only prompts (or prompts with metadata columns like solution for reward computation), and the model generates completions on-the-fly.

The GRPO Prompt Dataset Loading principle covers how datasets are loaded, mixed, and prepared for the online RL training pipeline. TRL supports two dataset loading paths: a simple single-dataset path via dataset_name, and a mixture-of-datasets path via the DatasetMixtureConfig system.

Usage

Datasets for GRPO must contain at minimum a "prompt" column. They may also contain additional columns (e.g., "solution") that are forwarded to reward functions as keyword arguments. The prompt format can be either:

  • Standard: Each prompt is a plain text string
  • Conversational: Each prompt is a list of message dictionaries with "role" and "content" keys

The dataset loading supports:

  • Single dataset via dataset_name and dataset_config
  • Multiple datasets via DatasetMixtureConfig with per-dataset column selection
  • Optional train/test splitting via test_split_size
  • Streaming mode for large datasets

Theoretical Basis

The distinction between offline and online RL training paradigms is central to understanding GRPO's data requirements:

Offline RL (e.g., DPO, KTO) trains on pre-collected preference data. The dataset contains both prompts and completions (often pairs of chosen/rejected responses). The policy never generates text during training.

Online RL (e.g., GRPO, PPO) requires the current policy to generate completions. This means:

  1. The dataset needs only prompts, not completions
  2. Data diversity comes from the model's sampling (controlled by temperature, top_p, etc.) rather than from pre-collected responses
  3. Additional columns in the dataset serve as metadata for reward computation, not as training targets

The mixture-of-datasets approach via DatasetMixtureConfig enables multi-task training scenarios where prompts from different domains or difficulty levels are combined. Each dataset in the mixture can specify a subset of columns to include, allowing heterogeneous dataset schemas to be unified. The datasets are concatenated after loading and optional column selection.

When remove_unused_columns is set to False (the default in GRPO), all columns are preserved and forwarded to reward functions. This is essential because reward functions often need access to ground-truth solutions or other metadata that would otherwise be stripped by the standard Trainer column-removal logic.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment