Implementation:Huggingface Open r1 GRPOTrainer Usage

Metadata

Field	Value
Sources	Repo: huggingface/open-r1; Doc: TRL GRPOTrainer
Domains	NLP, Reinforcement_Learning, Training
Last Updated	2026-02-08 00:00 GMT

Principle:Huggingface_Open_r1_GRPO_Training

Overview

Wrapper for HuggingFace TRL's GRPOTrainer for reinforcement learning-based training with configurable reward functions, configured with Open-R1's custom GRPOConfig and reward registry.

Description

This is a Wrapper Doc. Open-R1 uses TRL's GRPOTrainer with its extended GRPOConfig that adds reward function selection, code execution providers, benchmark callbacks, and Hub revision pushing. The trainer accepts a list of reward functions (resolved from the REWARD_FUNCS_REGISTRY), supports chat-formatted prompts via make_conversation, and integrates with Open-R1's callback system for per-checkpoint evaluation.

The training pipeline in grpo.py orchestrates the full workflow: parsing configuration, loading the model and dataset, resolving reward functions from the registry, initializing the trainer, and executing the training loop with checkpoint resumption support.

Usage

Use when running the Open-R1 GRPO training workflow for improving reasoning capabilities via RL. This is typically invoked through accelerate launch with a YAML recipe configuration file that specifies all hyperparameters, reward functions, and dataset settings.

Code Reference

Source Location

Property	Value
Repository	open-r1
File	`src/open_r1/grpo.py`
Lines	L112-137

Signature

trainer = GRPOTrainer(
    model=model,                           # AutoModelForCausalLM
    reward_funcs=reward_funcs,             # list[Callable]
    args=training_args,                     # GRPOConfig
    train_dataset=dataset[train_split],     # Dataset
    eval_dataset=dataset[test_split],       # Optional[Dataset]
    peft_config=get_peft_config(model_args),  # Optional[PeftConfig]
    callbacks=get_callbacks(training_args, model_args),  # list[TrainerCallback]
    processing_class=tokenizer,             # PreTrainedTokenizer
)
train_result = trainer.train(resume_from_checkpoint=checkpoint)

Import

from trl import GRPOTrainer, get_peft_config

External Reference

TRL GRPOTrainer Documentation

I/O Contract

Inputs

Parameter	Type	Required	Description
`model`	`AutoModelForCausalLM`	Yes	The language model to train via GRPO reinforcement learning.
`reward_funcs`	`list[Callable]`	Yes	List of reward functions resolved from `REWARD_FUNCS_REGISTRY`. Each function scores a completion and returns a numeric reward.
`args`	`GRPOConfig`	Yes	Extended training configuration including `learning_rate`, `num_generations`, `temperature`, `max_completion_length`, and `beta` (KL penalty coefficient).
`train_dataset`	`Dataset`	Yes	Training dataset with chat-formatted prompts (conversation structure with user messages).
`eval_dataset`	`Dataset`	No	Optional evaluation dataset for periodic validation during training.
`peft_config`	`PeftConfig`	No	Optional PEFT/LoRA configuration for parameter-efficient training.
`callbacks`	`list[TrainerCallback]`	No	Optional list of callbacks (e.g., benchmark evaluation, Hub revision pushing).
`processing_class`	`PreTrainedTokenizer`	Yes	Tokenizer for encoding prompts and decoding completions.

Outputs

Output	Type	Description
Return value	`TrainOutput`	Training result object containing metrics (loss, rewards, completion lengths).
Side effect	Model weights	Model parameters are updated in-place via the GRPO RL algorithm.
Side effect	Metrics	Training and evaluation metrics are logged to the configured logger (W&B, TensorBoard) and saved to disk.

Usage Example

The following shows the GRPO training setup as structured in grpo.py:

# 1. Parse configuration from YAML recipe
parser = TrlParser((GRPOScriptArguments, GRPOConfig, ModelConfig))
script_args, training_args, model_args = parser.parse_args_and_config()

# 2. Load dataset and format as conversations
dataset = get_dataset(script_args, training_args)
dataset = dataset.map(make_conversation, fn_kwargs={"script_args": script_args})

# 3. Load model and tokenizer
model = get_model(model_args, training_args)
tokenizer = get_tokenizer(model_args, training_args)

# 4. Resolve reward functions from registry
reward_funcs = get_reward_funcs(script_args, training_args)

# 5. Initialize trainer and run
trainer = GRPOTrainer(
    model=model,
    reward_funcs=reward_funcs,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset.get("test"),
    peft_config=get_peft_config(model_args),
    callbacks=get_callbacks(training_args, model_args),
    processing_class=tokenizer,
)
trainer.train(resume_from_checkpoint=last_checkpoint)
trainer.save_model(training_args.output_dir)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment