Implementation:Huggingface Trl GRPOTrainer Init
Appearance
| Property | Value |
|---|---|
| Implementation Name | GRPOTrainer Initialization |
| Library | Huggingface TRL |
| Type | API Doc |
| Source Files | trl/trainer/grpo_trainer.py (L248-782)
|
| Import | from trl import GRPOTrainer, GRPOConfig
|
Overview
Description
The GRPOTrainer.__init__ method sets up the complete GRPO training pipeline. It extends BaseTrainer (which in turn extends transformers.Trainer) and adds GRPO-specific initialization for model loading, reward function setup, reference model management, generation backend configuration, and distributed training preparation.
Usage
from trl import GRPOTrainer, GRPOConfig
from trl.rewards import accuracy_reward, think_format_reward
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
reward_funcs=[accuracy_reward, think_format_reward],
args=GRPOConfig(output_dir="./output", num_generations=8),
train_dataset=dataset,
)
Code Reference
Source Location
| Component | File | Lines |
|---|---|---|
GRPOTrainer.__init__ |
trl/trainer/grpo_trainer.py |
L248-782 |
| Class definition and docstring | trl/trainer/grpo_trainer.py |
L122-230 |
Signature
class GRPOTrainer(BaseTrainer):
def __init__(
self,
model: str | PreTrainedModel | PeftModel,
reward_funcs: RewardFunc | list[RewardFunc],
args: GRPOConfig | None = None,
train_dataset: Dataset | IterableDataset | None = None,
eval_dataset: Dataset | IterableDataset | dict[str, Dataset | IterableDataset] | None = None,
processing_class: PreTrainedTokenizerBase | ProcessorMixin | None = None,
reward_processing_classes: PreTrainedTokenizerBase | list[PreTrainedTokenizerBase] | None = None,
callbacks: list[TrainerCallback] | None = None,
optimizers: tuple[torch.optim.Optimizer | None, torch.optim.lr_scheduler.LambdaLR | None] = (None, None),
peft_config: PeftConfig | None = None,
tools: list[Callable] | None = None,
rollout_func: RolloutFunc | None = None,
):
Import
from trl import GRPOTrainer, GRPOConfig
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
model |
PreTrainedModel | PeftModel | Yes | Model ID string, local path, or pre-loaded model object. String paths trigger deferred loading via create_model_from_path.
|
reward_funcs |
list[RewardFunc] | Yes | One or more reward functions. Can be model ID strings (loaded as AutoModelForSequenceClassification), PreTrainedModel instances, or callables.
|
args |
None | No | Training configuration. If None, defaults are created from the model name.
|
train_dataset |
None | No | Dataset with a "prompt" column. IterableDataset is not yet supported.
|
eval_dataset |
dict | None | No | Evaluation dataset(s) following the same format as train_dataset.
|
processing_class |
ProcessorMixin | None | No | Tokenizer or processor. If None, loaded from the model config. Padding side is set to "left".
|
reward_processing_classes |
None | No | Tokenizers for model-based reward functions. Auto-loaded from reward model configs if None.
|
callbacks |
None | No | Custom training callbacks. |
optimizers |
tuple |
No | Custom (optimizer, scheduler) pair. Defaults to (None, None) for standard AdamW.
|
peft_config |
None | No | PEFT configuration for adapter wrapping. Must not be used with a pre-wrapped PeftModel. |
tools |
None | No | Tool functions for agentic training. Requires transformers >= 5.0.0 and jmespath. |
rollout_func |
None | No | Custom generation function (experimental). Must return dict with "prompt_ids", "completion_ids", "logprobs". |
Outputs
| Output | Type | Description |
|---|---|---|
| trainer instance | GRPOTrainer |
Fully initialized trainer ready for .train() invocation.
|
Usage Examples
Minimal initialization:
from trl import GRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
Full initialization with multiple reward functions and vLLM:
from trl import GRPOTrainer, GRPOConfig
from trl.rewards import accuracy_reward, think_format_reward
from peft import LoraConfig
config = GRPOConfig(
output_dir="./grpo_output",
num_generations=16,
max_completion_length=1024,
beta=0.001,
loss_type="dapo",
use_vllm=True,
vllm_mode="server",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
)
peft_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
task_type="CAUSAL_LM",
)
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-7B-Instruct",
reward_funcs=[accuracy_reward, think_format_reward],
args=config,
train_dataset=dataset,
peft_config=peft_config,
)
Initialization with a pretrained reward model:
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-7B-Instruct",
reward_funcs=[
"Skywork/Skywork-Reward-Llama-3.1-8B-v0.2", # loaded as AutoModelForSequenceClassification
accuracy_reward, # callable reward function
],
args=config,
train_dataset=dataset,
)
Related Pages
- Principle:Huggingface_Trl_GRPO_Trainer_Initialization
- Environment:Huggingface_Trl_Python_Core_Dependencies
- Environment:Huggingface_Trl_vLLM_Generation_Environment
- Environment:Huggingface_Trl_PEFT_LoRA_Environment
- Environment:Huggingface_Trl_DeepSpeed_Environment
- Heuristic:Huggingface_Trl_Gradient_Checkpointing_Use_Reentrant
- Heuristic:Huggingface_Trl_QLoRA_BF16_Adapter_Casting
- Heuristic:Huggingface_Trl_Disable_Dropout_For_RL_Training
- Heuristic:Huggingface_Trl_DeepSpeed_ZeRO3_Generation_Tradeoff
- Heuristic:Huggingface_Trl_Distributed_Device_Map_Override
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment