Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Trl GRPOTrainer Init

From Leeroopedia


Property Value
Implementation Name GRPOTrainer Initialization
Library Huggingface TRL
Type API Doc
Source Files trl/trainer/grpo_trainer.py (L248-782)
Import from trl import GRPOTrainer, GRPOConfig

Overview

Description

The GRPOTrainer.__init__ method sets up the complete GRPO training pipeline. It extends BaseTrainer (which in turn extends transformers.Trainer) and adds GRPO-specific initialization for model loading, reward function setup, reference model management, generation backend configuration, and distributed training preparation.

Usage

from trl import GRPOTrainer, GRPOConfig
from trl.rewards import accuracy_reward, think_format_reward

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=[accuracy_reward, think_format_reward],
    args=GRPOConfig(output_dir="./output", num_generations=8),
    train_dataset=dataset,
)

Code Reference

Source Location

Component File Lines
GRPOTrainer.__init__ trl/trainer/grpo_trainer.py L248-782
Class definition and docstring trl/trainer/grpo_trainer.py L122-230

Signature

class GRPOTrainer(BaseTrainer):
    def __init__(
        self,
        model: str | PreTrainedModel | PeftModel,
        reward_funcs: RewardFunc | list[RewardFunc],
        args: GRPOConfig | None = None,
        train_dataset: Dataset | IterableDataset | None = None,
        eval_dataset: Dataset | IterableDataset | dict[str, Dataset | IterableDataset] | None = None,
        processing_class: PreTrainedTokenizerBase | ProcessorMixin | None = None,
        reward_processing_classes: PreTrainedTokenizerBase | list[PreTrainedTokenizerBase] | None = None,
        callbacks: list[TrainerCallback] | None = None,
        optimizers: tuple[torch.optim.Optimizer | None, torch.optim.lr_scheduler.LambdaLR | None] = (None, None),
        peft_config: PeftConfig | None = None,
        tools: list[Callable] | None = None,
        rollout_func: RolloutFunc | None = None,
    ):

Import

from trl import GRPOTrainer, GRPOConfig

I/O Contract

Inputs

Parameter Type Required Description
model PreTrainedModel | PeftModel Yes Model ID string, local path, or pre-loaded model object. String paths trigger deferred loading via create_model_from_path.
reward_funcs list[RewardFunc] Yes One or more reward functions. Can be model ID strings (loaded as AutoModelForSequenceClassification), PreTrainedModel instances, or callables.
args None No Training configuration. If None, defaults are created from the model name.
train_dataset None No Dataset with a "prompt" column. IterableDataset is not yet supported.
eval_dataset dict | None No Evaluation dataset(s) following the same format as train_dataset.
processing_class ProcessorMixin | None No Tokenizer or processor. If None, loaded from the model config. Padding side is set to "left".
reward_processing_classes None No Tokenizers for model-based reward functions. Auto-loaded from reward model configs if None.
callbacks None No Custom training callbacks.
optimizers tuple No Custom (optimizer, scheduler) pair. Defaults to (None, None) for standard AdamW.
peft_config None No PEFT configuration for adapter wrapping. Must not be used with a pre-wrapped PeftModel.
tools None No Tool functions for agentic training. Requires transformers >= 5.0.0 and jmespath.
rollout_func None No Custom generation function (experimental). Must return dict with "prompt_ids", "completion_ids", "logprobs".

Outputs

Output Type Description
trainer instance GRPOTrainer Fully initialized trainer ready for .train() invocation.

Usage Examples

Minimal initialization:

from trl import GRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)

Full initialization with multiple reward functions and vLLM:

from trl import GRPOTrainer, GRPOConfig
from trl.rewards import accuracy_reward, think_format_reward
from peft import LoraConfig

config = GRPOConfig(
    output_dir="./grpo_output",
    num_generations=16,
    max_completion_length=1024,
    beta=0.001,
    loss_type="dapo",
    use_vllm=True,
    vllm_mode="server",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM",
)

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-7B-Instruct",
    reward_funcs=[accuracy_reward, think_format_reward],
    args=config,
    train_dataset=dataset,
    peft_config=peft_config,
)

Initialization with a pretrained reward model:

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-7B-Instruct",
    reward_funcs=[
        "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2",  # loaded as AutoModelForSequenceClassification
        accuracy_reward,  # callable reward function
    ],
    args=config,
    train_dataset=dataset,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment