Workflow:Huggingface Trl GRPO Training

Knowledge Sources	HuggingFace TRL TRL GRPO Trainer Docs DeepSeekMath GRPO
Domains	LLMs, Reinforcement_Learning, RLHF
Last Updated	2026-02-06 16:00 GMT

Overview

End-to-end process for training language models with Group Relative Policy Optimization (GRPO), an online reinforcement learning method that generates multiple completions per prompt and uses group-level reward normalization for policy updates.

Description

This workflow implements online RL training where the model generates multiple completions for each prompt, scores them with reward functions, and updates the policy using relative advantages within each group. GRPO eliminates the need for a separate critic/value model by computing advantages as normalized reward differences within each generation group. TRL's GRPOTrainer supports multiple reward sources (callable functions, HuggingFace reward models, or custom model instances), optional vLLM integration for fast generation, and multiple loss variants including standard GRPO, DAPO, Dr. GRPO, BNPO, CISPO, and SAPO.

Usage

Execute this workflow when you want to improve a language model's reasoning or task performance using reward signals rather than static preference data. This is especially effective for math reasoning, code generation, and structured output tasks where reward functions can be defined programmatically (e.g., checking answer correctness). GRPO is the recommended online RL method in TRL for its simplicity and effectiveness.

Execution Steps

Step 1: Environment and Argument Configuration

Configure the GRPO training run including the policy model, reward functions, generation parameters, and training hyperparameters. GRPO has unique parameters controlling the generation-training balance.

Key considerations:

num_generations controls how many completions are generated per prompt (e.g., 8-16)
max_completion_length sets the maximum length of generated responses
Generation parameters: temperature, top_p, top_k control sampling diversity
use_vllm enables vLLM backend for faster generation (server or colocate mode)
steps_per_generation amortizes generation cost across multiple training steps

Step 2: Reward Function Definition

Define or load the reward functions that score model completions. Reward functions are the core signal driving GRPO training. They can be Python callables, HuggingFace reward model identifiers, or custom model instances.

Key considerations:

Callable rewards receive completions (list of strings) and optional kwargs with prompt metadata
Multiple reward functions can be combined; their scores are summed
Built-in rewards include accuracy_reward (math answer checking), think_format_reward (structured output), and get_soft_overlong_punishment (length penalty)
Reward models (string identifiers) use AutoModelForSequenceClassification internally
Custom reward functions can access any column from the dataset via kwargs

Step 3: Model Loading

Load the policy model, typically starting from an SFT-trained checkpoint. GRPO can load models lazily (from path string) or from pre-instantiated model objects. The model generates completions and is updated via policy gradient.

Key considerations:

Pass the model as a string path for automatic loading with model_init_kwargs
Use bfloat16 dtype for training efficiency
Optional quantization via BitsAndBytesConfig for memory-constrained setups
With vLLM colocate mode, the model shares GPU memory between training and generation

Step 4: Prompt Dataset Loading

Load the prompt-only dataset. Unlike SFT or DPO, GRPO only requires prompts; the model generates completions during training. Additional dataset columns are forwarded to reward functions as keyword arguments.

Key considerations:

Dataset must have a prompt field (plain text or conversational format)
Extra columns (e.g., ground truth answers) are passed to reward functions automatically
For math tasks, include the expected answer for accuracy reward computation
Dataset is batched with num_generations copies per prompt

Step 5: Trainer Initialization

Create the GRPOTrainer with the policy model, reward functions, prompt dataset, and optional PEFT configuration. The trainer sets up the generation pipeline, reward computation, and policy optimization loop.

Key considerations:

If using vLLM, the trainer starts the vLLM server/colocate process automatically
A reference model is created internally for KL penalty computation
peft_config wraps the model with LoRA adapters if provided
The trainer handles multi-GPU generation and reward computation

Step 6: Generation, Scoring, and Training Loop

The training loop alternates between three phases: (1) generate completions from the current policy, (2) score completions with reward functions, and (3) update the policy using group-relative advantages.

What happens per training iteration:

Generation phase: For each prompt, generate num_generations completions using the current policy (or vLLM)
Reward phase: Compute reward scores for each completion using all reward functions
Advantage computation: Normalize rewards within each group (mean-center, std-normalize)
Policy update: Compute clipped policy gradient loss weighted by advantages
KL penalty: Optionally penalize deviation from the reference model

Key considerations:

Group normalization means only relative performance within a batch matters
The clip_range (epsilon) prevents overly large policy updates
num_iterations allows multiple gradient steps per generation batch
Monitor reward/mean, kl, and loss/policy_gradient for training health

Step 7: Model Saving and Distribution

Save the trained policy model (or LoRA adapters) after training completes. The model is ready for inference or further training.

Key considerations:

With PEFT, only adapter weights are saved
Push to HuggingFace Hub for sharing and deployment
The trained model shows improved performance on the reward-targeted task

Execution Diagram

GitHub URL

Workflow Repository