Implementation:Huggingface Open r1 GRPOTrainer Usage
Metadata
| Field | Value |
|---|---|
| Sources | Repo: huggingface/open-r1; Doc: TRL GRPOTrainer |
| Domains | NLP, Reinforcement_Learning, Training |
| Last Updated | 2026-02-08 00:00 GMT |
Principle:Huggingface_Open_r1_GRPO_Training
Overview
Wrapper for HuggingFace TRL's GRPOTrainer for reinforcement learning-based training with configurable reward functions, configured with Open-R1's custom GRPOConfig and reward registry.
Description
This is a Wrapper Doc. Open-R1 uses TRL's GRPOTrainer with its extended GRPOConfig that adds reward function selection, code execution providers, benchmark callbacks, and Hub revision pushing. The trainer accepts a list of reward functions (resolved from the REWARD_FUNCS_REGISTRY), supports chat-formatted prompts via make_conversation, and integrates with Open-R1's callback system for per-checkpoint evaluation.
The training pipeline in grpo.py orchestrates the full workflow: parsing configuration, loading the model and dataset, resolving reward functions from the registry, initializing the trainer, and executing the training loop with checkpoint resumption support.
Usage
Use when running the Open-R1 GRPO training workflow for improving reasoning capabilities via RL. This is typically invoked through accelerate launch with a YAML recipe configuration file that specifies all hyperparameters, reward functions, and dataset settings.
Code Reference
Source Location
| Property | Value |
|---|---|
| Repository | open-r1 |
| File | src/open_r1/grpo.py
|
| Lines | L112-137 |
Signature
trainer = GRPOTrainer(
model=model, # AutoModelForCausalLM
reward_funcs=reward_funcs, # list[Callable]
args=training_args, # GRPOConfig
train_dataset=dataset[train_split], # Dataset
eval_dataset=dataset[test_split], # Optional[Dataset]
peft_config=get_peft_config(model_args), # Optional[PeftConfig]
callbacks=get_callbacks(training_args, model_args), # list[TrainerCallback]
processing_class=tokenizer, # PreTrainedTokenizer
)
train_result = trainer.train(resume_from_checkpoint=checkpoint)
Import
from trl import GRPOTrainer, get_peft_config
External Reference
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
model |
AutoModelForCausalLM |
Yes | The language model to train via GRPO reinforcement learning. |
reward_funcs |
list[Callable] |
Yes | List of reward functions resolved from REWARD_FUNCS_REGISTRY. Each function scores a completion and returns a numeric reward.
|
args |
GRPOConfig |
Yes | Extended training configuration including learning_rate, num_generations, temperature, max_completion_length, and beta (KL penalty coefficient).
|
train_dataset |
Dataset |
Yes | Training dataset with chat-formatted prompts (conversation structure with user messages). |
eval_dataset |
Dataset |
No | Optional evaluation dataset for periodic validation during training. |
peft_config |
PeftConfig |
No | Optional PEFT/LoRA configuration for parameter-efficient training. |
callbacks |
list[TrainerCallback] |
No | Optional list of callbacks (e.g., benchmark evaluation, Hub revision pushing). |
processing_class |
PreTrainedTokenizer |
Yes | Tokenizer for encoding prompts and decoding completions. |
Outputs
| Output | Type | Description |
|---|---|---|
| Return value | TrainOutput |
Training result object containing metrics (loss, rewards, completion lengths). |
| Side effect | Model weights | Model parameters are updated in-place via the GRPO RL algorithm. |
| Side effect | Metrics | Training and evaluation metrics are logged to the configured logger (W&B, TensorBoard) and saved to disk. |
Usage Example
The following shows the GRPO training setup as structured in grpo.py:
# 1. Parse configuration from YAML recipe
parser = TrlParser((GRPOScriptArguments, GRPOConfig, ModelConfig))
script_args, training_args, model_args = parser.parse_args_and_config()
# 2. Load dataset and format as conversations
dataset = get_dataset(script_args, training_args)
dataset = dataset.map(make_conversation, fn_kwargs={"script_args": script_args})
# 3. Load model and tokenizer
model = get_model(model_args, training_args)
tokenizer = get_tokenizer(model_args, training_args)
# 4. Resolve reward functions from registry
reward_funcs = get_reward_funcs(script_args, training_args)
# 5. Initialize trainer and run
trainer = GRPOTrainer(
model=model,
reward_funcs=reward_funcs,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset.get("test"),
peft_config=get_peft_config(model_args),
callbacks=get_callbacks(training_args, model_args),
processing_class=tokenizer,
)
trainer.train(resume_from_checkpoint=last_checkpoint)
trainer.save_model(training_args.output_dir)