Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Open r1 GRPOTrainer Usage

From Leeroopedia


Metadata

Field Value
Sources Repo: huggingface/open-r1; Doc: TRL GRPOTrainer
Domains NLP, Reinforcement_Learning, Training
Last Updated 2026-02-08 00:00 GMT

Principle:Huggingface_Open_r1_GRPO_Training

Overview

Wrapper for HuggingFace TRL's GRPOTrainer for reinforcement learning-based training with configurable reward functions, configured with Open-R1's custom GRPOConfig and reward registry.

Description

This is a Wrapper Doc. Open-R1 uses TRL's GRPOTrainer with its extended GRPOConfig that adds reward function selection, code execution providers, benchmark callbacks, and Hub revision pushing. The trainer accepts a list of reward functions (resolved from the REWARD_FUNCS_REGISTRY), supports chat-formatted prompts via make_conversation, and integrates with Open-R1's callback system for per-checkpoint evaluation.

The training pipeline in grpo.py orchestrates the full workflow: parsing configuration, loading the model and dataset, resolving reward functions from the registry, initializing the trainer, and executing the training loop with checkpoint resumption support.

Usage

Use when running the Open-R1 GRPO training workflow for improving reasoning capabilities via RL. This is typically invoked through accelerate launch with a YAML recipe configuration file that specifies all hyperparameters, reward functions, and dataset settings.

Code Reference

Source Location

Property Value
Repository open-r1
File src/open_r1/grpo.py
Lines L112-137

Signature

trainer = GRPOTrainer(
    model=model,                           # AutoModelForCausalLM
    reward_funcs=reward_funcs,             # list[Callable]
    args=training_args,                     # GRPOConfig
    train_dataset=dataset[train_split],     # Dataset
    eval_dataset=dataset[test_split],       # Optional[Dataset]
    peft_config=get_peft_config(model_args),  # Optional[PeftConfig]
    callbacks=get_callbacks(training_args, model_args),  # list[TrainerCallback]
    processing_class=tokenizer,             # PreTrainedTokenizer
)
train_result = trainer.train(resume_from_checkpoint=checkpoint)

Import

from trl import GRPOTrainer, get_peft_config

External Reference

TRL GRPOTrainer Documentation

I/O Contract

Inputs

Parameter Type Required Description
model AutoModelForCausalLM Yes The language model to train via GRPO reinforcement learning.
reward_funcs list[Callable] Yes List of reward functions resolved from REWARD_FUNCS_REGISTRY. Each function scores a completion and returns a numeric reward.
args GRPOConfig Yes Extended training configuration including learning_rate, num_generations, temperature, max_completion_length, and beta (KL penalty coefficient).
train_dataset Dataset Yes Training dataset with chat-formatted prompts (conversation structure with user messages).
eval_dataset Dataset No Optional evaluation dataset for periodic validation during training.
peft_config PeftConfig No Optional PEFT/LoRA configuration for parameter-efficient training.
callbacks list[TrainerCallback] No Optional list of callbacks (e.g., benchmark evaluation, Hub revision pushing).
processing_class PreTrainedTokenizer Yes Tokenizer for encoding prompts and decoding completions.

Outputs

Output Type Description
Return value TrainOutput Training result object containing metrics (loss, rewards, completion lengths).
Side effect Model weights Model parameters are updated in-place via the GRPO RL algorithm.
Side effect Metrics Training and evaluation metrics are logged to the configured logger (W&B, TensorBoard) and saved to disk.

Usage Example

The following shows the GRPO training setup as structured in grpo.py:

# 1. Parse configuration from YAML recipe
parser = TrlParser((GRPOScriptArguments, GRPOConfig, ModelConfig))
script_args, training_args, model_args = parser.parse_args_and_config()

# 2. Load dataset and format as conversations
dataset = get_dataset(script_args, training_args)
dataset = dataset.map(make_conversation, fn_kwargs={"script_args": script_args})

# 3. Load model and tokenizer
model = get_model(model_args, training_args)
tokenizer = get_tokenizer(model_args, training_args)

# 4. Resolve reward functions from registry
reward_funcs = get_reward_funcs(script_args, training_args)

# 5. Initialize trainer and run
trainer = GRPOTrainer(
    model=model,
    reward_funcs=reward_funcs,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset.get("test"),
    peft_config=get_peft_config(model_args),
    callbacks=get_callbacks(training_args, model_args),
    processing_class=tokenizer,
)
trainer.train(resume_from_checkpoint=last_checkpoint)
trainer.save_model(training_args.output_dir)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment