Workflow:Unslothai Unsloth GRPO Reinforcement Learning

Knowledge Sources	Unsloth Unsloth Docs RL Guide Memory-Efficient RL
Domains	LLMs, Reinforcement_Learning, GRPO
Last Updated	2026-02-07 09:00 GMT

Overview

End-to-end process for training reasoning capabilities in language models using Group Relative Policy Optimization (GRPO) with custom reward functions and Unsloth's memory-efficient RL pipeline.

Description

This workflow implements reinforcement learning for LLMs using GRPO, a policy optimization algorithm that trains models to produce structured reasoning traces and correct answers. It optionally begins with a supervised fine-tuning (SFT) warmup stage to establish baseline instruction-following, then transitions to GRPO training where the model generates multiple candidate responses per prompt and learns from reward signals. Unsloth patches TRL's GRPOTrainer to use optimized batching algorithms that enable 7x longer context RL with 50-80% less VRAM. The workflow supports multiple reward functions including format compliance, answer correctness, and numerical accuracy checking.

Key capabilities:

80% less VRAM compared to standard GRPO implementations
7x longer context windows for RL training through optimized batching
FP8 RL training support on consumer GPUs
Compatible with GRPO, GSPO, DrGRPO, DAPO, PPO, DPO, KTO, and other TRL trainers
vLLM integration for fast generation during RL rollouts
Custom reward function interface for domain-specific training signals

Usage

Execute this workflow when you want to train a model to produce structured reasoning (e.g., chain-of-thought with explicit answer extraction) and have a dataset with verifiable ground-truth answers (such as math problems, code, or factual QA). This is particularly useful for transforming general-purpose models into reasoning-capable models, similar to DeepSeek-R1 style training.

Execution Steps

Step 1: Base Model Loading

Load the base or instruction-tuned model with Unsloth's unified loader, enabling vLLM fast inference for efficient generation during RL rollouts. The loader configures the model for both training and inference modes, allocating GPU memory between the training process and the vLLM inference engine.

Key considerations:

Enable fast_inference=True to use vLLM for generation during GRPO rollouts
Set gpu_memory_utilization to balance between training and inference memory
Set max_lora_rank to accommodate the LoRA rank you plan to use
Choose an instruction-tuned base model for better starting performance

Step 2: Dataset Preparation

Prepare the training dataset with prompt-answer pairs in the format expected by GRPO. Each example needs a prompt field (list of message dicts) and an answer field (ground-truth for reward computation). Apply the appropriate chat template and system prompt that instructs the model to produce structured output with reasoning tags.

Key considerations:

Format prompts as lists of message dicts with system, user, and optionally assistant roles
Include a system prompt that defines the expected output structure (e.g., reasoning/answer XML tags)
Ensure ground-truth answers are extractable and comparable for reward computation
Consider an optional SFT warmup stage on high-quality reasoning examples before GRPO

Step 3: Optional SFT Warmup

Optionally perform a supervised fine-tuning warmup stage to establish baseline instruction-following and format compliance before RL training. This is done by training on high-quality reasoning examples (such as LIMO or similar datasets) using train_on_responses_only to focus learning on the assistant's reasoning trace.

Key considerations:

Use train_on_responses_only to mask loss on system and user tokens
Train for 1 epoch on curated reasoning data
Save a checkpoint after SFT warmup before transitioning to GRPO
This step significantly improves GRPO convergence and format compliance

Step 4: Reward Function Definition

Define reward functions that evaluate generated responses on multiple criteria. GRPO uses these functions to compute per-sample rewards that guide policy optimization. Common reward functions check format compliance (presence of required XML tags), answer correctness (numerical comparison with ground truth), and response quality.

Reward function types:

Format compliance: checks for required structural tags (e.g., reasoning/answer blocks)
Approximate format: partial credit for close-to-correct formatting
Answer correctness: exact and approximate numerical matching against ground truth
Custom domain rewards: any callable that takes (prompts, completions, answer) and returns scores

Step 5: GRPO Training

Execute the GRPO training loop using TRL's GRPOTrainer with Unsloth's optimizations. For each batch, the model generates multiple candidate completions per prompt using vLLM, scores them with the reward functions, and updates the policy to favor higher-reward completions. Unsloth's patched trainer uses memory-efficient batching and optimized gradient computation.

Key considerations:

Set num_generations (e.g., 8) for the number of rollouts per prompt
Configure max_completion_length based on expected reasoning trace length
Use cosine learning rate schedule with warmup
Monitor reward signals and format compliance during training
Save checkpoints periodically for recovery and evaluation

Step 6: Evaluation

Evaluate the trained model on a held-out benchmark (such as AIME math problems) to measure reasoning quality. Compare base model, SFT-warmup, and final GRPO model performance across metrics like exact match accuracy, format compliance, and answer plausibility.

Key considerations:

Use consistent evaluation parameters (temperature, top_p, seed) across comparisons
Sample multiple responses per problem (Pass@K evaluation)
Compare format compliance rates between training stages
Evaluate on problems not seen during training

Step 7: Model Saving

Save the final GRPO-trained model by merging the LoRA adapter weights into the base model. The merged model can then be exported for deployment or further training.

Save options:

Merged 16-bit for maximum quality preservation
LoRA checkpoint only for lightweight storage
Push to HuggingFace Hub for sharing

Execution Diagram

GitHub URL

Workflow Repository