Workflow:Huggingface Open r1 GRPO Reasoning Training
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Reinforcement_Learning, Reasoning |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
End-to-end process for training reasoning models using Group Relative Policy Optimization (GRPO) with configurable reward functions including accuracy, format, code execution, and repetition penalties.
Description
This workflow implements the pure reinforcement learning pipeline for improving reasoning capabilities in language models. It uses GRPO (Group Relative Policy Optimization), where the model generates multiple candidate responses per prompt, scores them using a configurable set of reward functions, and updates its policy to favor higher-reward responses. The reward system is modular with 14 registered functions spanning mathematical accuracy verification, output format compliance, code execution evaluation (via E2B/Morph/Piston sandboxes), length-based penalties, and repetition control. Training supports both single-node (colocated vLLM) and multi-node (separate vLLM server) configurations.
Goal: A model with improved reasoning capabilities, trained via RL to produce accurate, well-formatted, and efficient responses.
Scope: From a base or distilled model and a verifiable problem dataset to an RL-trained model with enhanced reasoning.
Strategy: Uses TRL's GRPOTrainer with vLLM backend for fast generation, multiple reward signals for multi-objective optimization, and DeepSpeed for distributed training.
Usage
Execute this workflow when you want to improve a model's reasoning through reinforcement learning rather than supervised fine-tuning. This is appropriate when you have datasets with verifiable answers (math problems with solutions, coding problems with test cases) and want the model to learn to reason through RL exploration. This workflow supports training with code execution rewards, making it suitable for competitive programming model training (IOI, Codeforces).
Execution Steps
Step 1: Environment_Setup
Prepare the environment with core dependencies plus any code execution provider libraries. For standard math GRPO, the base installation suffices. For code reward training, install additional sandbox dependencies (E2B API key or Morph API key). Optionally launch router services for sandbox providers to manage rate limits during high-throughput training.
Key considerations:
- Code execution requires provider-specific setup (E2B API key, Morph API key, or Piston workers)
- Router services prevent rate limiting when many training processes execute code simultaneously
- For IOI/Codeforces training, Piston workers must be deployed on separate compute nodes
Step 2: Configuration_Preparation
Create a YAML configuration specifying the model, dataset, reward functions, and GRPO-specific hyperparameters. The configuration selects which reward functions to use (from the registry of 14 options) with associated weights, sets the number of generations per prompt, and configures vLLM settings. Critically, the chat template must be carefully set for distilled DeepSeek models to avoid interfering with format rewards.
Key considerations:
- Reward functions are selected by name from the registry (accuracy, format, tag_count, code, ioi_code, cf_code, etc.)
- Reward weights control the relative importance of each signal
- The chat template for DeepSeek models must be overridden to include reasoning block content
- System prompt guides the model to use think/answer format
- vLLM can run colocated (single node) or as a separate server (multi-node)
Step 3: Dataset_Loading_and_Formatting
Load the training dataset from the HuggingFace Hub and format each example into a conversation structure. The prompt column is mapped to a user message, with an optional system prompt prepended. For code training, the dataset must include a verification_info column with test cases. Dataset mixtures are supported for blending multiple problem sources.
Key considerations:
- The dataset_prompt_column config specifies which column contains the problem text
- Code datasets need a verification_info column with test cases and language specification
- For IOI/Codeforces datasets, additional metadata columns (subtask info, test case paths) are required
- The messages column is removed after formatting to avoid conflicts
Step 4: Model_Loading_and_Reward_Setup
Load the base model and tokenizer, then resolve reward functions from the registry. Each reward function string name is mapped to its callable implementation, with parameterized rewards (cosine, repetition penalty, soft overlong punishment) receiving their configuration values. Code execution providers are initialized based on the selected provider type and router URLs.
Key considerations:
- Reward functions are resolved dynamically from REWARD_FUNCS_REGISTRY
- Parameterized rewards use partial application with config-driven parameters
- Code execution providers support E2B, Morph, and Piston backends
- PEFT (LoRA) configuration can be applied for parameter-efficient training
Step 5: GRPO_Training_Loop
Launch the GRPOTrainer which orchestrates the training loop. For each batch: the policy generates num_generations responses per prompt using vLLM, reward functions score each response, and the GRPO algorithm computes advantages relative to the group mean and updates the policy. Training supports checkpoint resumption, gradient checkpointing, and periodic evaluation.
Key considerations:
- num_generations (typically 14-16) controls the group size for relative advantage computation
- vLLM handles fast parallel generation with configurable temperature
- Training can span multiple nodes with separate vLLM server node
- Gradient accumulation and checkpointing manage memory on large models
- W&B logging tracks rewards, completion lengths, and training metrics per step
Step 6: Model_Saving_and_Publishing
Save the trained model with aligned generation config. The EOS token is synchronized to prevent unbounded generation. A model card is created with training metadata and pushed to the HuggingFace Hub. Per-checkpoint Hub revisions can be enabled via the PushToHubRevisionCallback for fine-grained model selection.
Key considerations:
- Per-checkpoint publishing enables evaluating intermediate training states
- The KV cache is re-enabled for inference after training
- Hub revisions include step numbers for tracking training progress