Workflow:Unslothai Unsloth GRPO Reinforcement Learning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Reinforcement_Learning, GRPO |
| Last Updated | 2026-02-07 09:00 GMT |
Overview
End-to-end process for training reasoning capabilities in language models using Group Relative Policy Optimization (GRPO) with custom reward functions and Unsloth's memory-efficient RL pipeline.
Description
This workflow implements reinforcement learning for LLMs using GRPO, a policy optimization algorithm that trains models to produce structured reasoning traces and correct answers. It optionally begins with a supervised fine-tuning (SFT) warmup stage to establish baseline instruction-following, then transitions to GRPO training where the model generates multiple candidate responses per prompt and learns from reward signals. Unsloth patches TRL's GRPOTrainer to use optimized batching algorithms that enable 7x longer context RL with 50-80% less VRAM. The workflow supports multiple reward functions including format compliance, answer correctness, and numerical accuracy checking.
Key capabilities:
- 80% less VRAM compared to standard GRPO implementations
- 7x longer context windows for RL training through optimized batching
- FP8 RL training support on consumer GPUs
- Compatible with GRPO, GSPO, DrGRPO, DAPO, PPO, DPO, KTO, and other TRL trainers
- vLLM integration for fast generation during RL rollouts
- Custom reward function interface for domain-specific training signals
Usage
Execute this workflow when you want to train a model to produce structured reasoning (e.g., chain-of-thought with explicit answer extraction) and have a dataset with verifiable ground-truth answers (such as math problems, code, or factual QA). This is particularly useful for transforming general-purpose models into reasoning-capable models, similar to DeepSeek-R1 style training.
Execution Steps
Step 1: Base Model Loading
Load the base or instruction-tuned model with Unsloth's unified loader, enabling vLLM fast inference for efficient generation during RL rollouts. The loader configures the model for both training and inference modes, allocating GPU memory between the training process and the vLLM inference engine.
Key considerations:
- Enable fast_inference=True to use vLLM for generation during GRPO rollouts
- Set gpu_memory_utilization to balance between training and inference memory
- Set max_lora_rank to accommodate the LoRA rank you plan to use
- Choose an instruction-tuned base model for better starting performance
Step 2: Dataset Preparation
Prepare the training dataset with prompt-answer pairs in the format expected by GRPO. Each example needs a prompt field (list of message dicts) and an answer field (ground-truth for reward computation). Apply the appropriate chat template and system prompt that instructs the model to produce structured output with reasoning tags.
Key considerations:
- Format prompts as lists of message dicts with system, user, and optionally assistant roles
- Include a system prompt that defines the expected output structure (e.g., reasoning/answer XML tags)
- Ensure ground-truth answers are extractable and comparable for reward computation
- Consider an optional SFT warmup stage on high-quality reasoning examples before GRPO
Step 3: Optional SFT Warmup
Optionally perform a supervised fine-tuning warmup stage to establish baseline instruction-following and format compliance before RL training. This is done by training on high-quality reasoning examples (such as LIMO or similar datasets) using train_on_responses_only to focus learning on the assistant's reasoning trace.
Key considerations:
- Use train_on_responses_only to mask loss on system and user tokens
- Train for 1 epoch on curated reasoning data
- Save a checkpoint after SFT warmup before transitioning to GRPO
- This step significantly improves GRPO convergence and format compliance
Step 4: Reward Function Definition
Define reward functions that evaluate generated responses on multiple criteria. GRPO uses these functions to compute per-sample rewards that guide policy optimization. Common reward functions check format compliance (presence of required XML tags), answer correctness (numerical comparison with ground truth), and response quality.
Reward function types:
- Format compliance: checks for required structural tags (e.g., reasoning/answer blocks)
- Approximate format: partial credit for close-to-correct formatting
- Answer correctness: exact and approximate numerical matching against ground truth
- Custom domain rewards: any callable that takes (prompts, completions, answer) and returns scores
Step 5: GRPO Training
Execute the GRPO training loop using TRL's GRPOTrainer with Unsloth's optimizations. For each batch, the model generates multiple candidate completions per prompt using vLLM, scores them with the reward functions, and updates the policy to favor higher-reward completions. Unsloth's patched trainer uses memory-efficient batching and optimized gradient computation.
Key considerations:
- Set num_generations (e.g., 8) for the number of rollouts per prompt
- Configure max_completion_length based on expected reasoning trace length
- Use cosine learning rate schedule with warmup
- Monitor reward signals and format compliance during training
- Save checkpoints periodically for recovery and evaluation
Step 6: Evaluation
Evaluate the trained model on a held-out benchmark (such as AIME math problems) to measure reasoning quality. Compare base model, SFT-warmup, and final GRPO model performance across metrics like exact match accuracy, format compliance, and answer plausibility.
Key considerations:
- Use consistent evaluation parameters (temperature, top_p, seed) across comparisons
- Sample multiple responses per problem (Pass@K evaluation)
- Compare format compliance rates between training stages
- Evaluate on problems not seen during training
Step 7: Model Saving
Save the final GRPO-trained model by merging the LoRA adapter weights into the base model. The merged model can then be exported for deployment or further training.
Save options:
- Merged 16-bit for maximum quality preservation
- LoRA checkpoint only for lightweight storage
- Push to HuggingFace Hub for sharing