Workflow:Huggingface Open r1 GRPO Reasoning Training

Knowledge Sources	Open R1 TRL Documentation DeepSeek R1 Tech Report DAPO Paper
Domains	LLMs, Reinforcement_Learning, Reasoning
Last Updated	2026-02-08 00:00 GMT

Overview

End-to-end process for training reasoning models using Group Relative Policy Optimization (GRPO) with configurable reward functions including accuracy, format, code execution, and repetition penalties.

Description

This workflow implements the pure reinforcement learning pipeline for improving reasoning capabilities in language models. It uses GRPO (Group Relative Policy Optimization), where the model generates multiple candidate responses per prompt, scores them using a configurable set of reward functions, and updates its policy to favor higher-reward responses. The reward system is modular with 14 registered functions spanning mathematical accuracy verification, output format compliance, code execution evaluation (via E2B/Morph/Piston sandboxes), length-based penalties, and repetition control. Training supports both single-node (colocated vLLM) and multi-node (separate vLLM server) configurations.

Goal: A model with improved reasoning capabilities, trained via RL to produce accurate, well-formatted, and efficient responses.

Scope: From a base or distilled model and a verifiable problem dataset to an RL-trained model with enhanced reasoning.

Strategy: Uses TRL's GRPOTrainer with vLLM backend for fast generation, multiple reward signals for multi-objective optimization, and DeepSpeed for distributed training.

Usage

Execute this workflow when you want to improve a model's reasoning through reinforcement learning rather than supervised fine-tuning. This is appropriate when you have datasets with verifiable answers (math problems with solutions, coding problems with test cases) and want the model to learn to reason through RL exploration. This workflow supports training with code execution rewards, making it suitable for competitive programming model training (IOI, Codeforces).

Execution Steps

Step 1: Environment_Setup

Prepare the environment with core dependencies plus any code execution provider libraries. For standard math GRPO, the base installation suffices. For code reward training, install additional sandbox dependencies (E2B API key or Morph API key). Optionally launch router services for sandbox providers to manage rate limits during high-throughput training.

Key considerations:

Code execution requires provider-specific setup (E2B API key, Morph API key, or Piston workers)
Router services prevent rate limiting when many training processes execute code simultaneously
For IOI/Codeforces training, Piston workers must be deployed on separate compute nodes

Step 2: Configuration_Preparation

Create a YAML configuration specifying the model, dataset, reward functions, and GRPO-specific hyperparameters. The configuration selects which reward functions to use (from the registry of 14 options) with associated weights, sets the number of generations per prompt, and configures vLLM settings. Critically, the chat template must be carefully set for distilled DeepSeek models to avoid interfering with format rewards.

Key considerations:

Reward functions are selected by name from the registry (accuracy, format, tag_count, code, ioi_code, cf_code, etc.)
Reward weights control the relative importance of each signal
The chat template for DeepSeek models must be overridden to include reasoning block content
System prompt guides the model to use think/answer format
vLLM can run colocated (single node) or as a separate server (multi-node)

Step 3: Dataset_Loading_and_Formatting

Load the training dataset from the HuggingFace Hub and format each example into a conversation structure. The prompt column is mapped to a user message, with an optional system prompt prepended. For code training, the dataset must include a verification_info column with test cases. Dataset mixtures are supported for blending multiple problem sources.

Key considerations:

The dataset_prompt_column config specifies which column contains the problem text
Code datasets need a verification_info column with test cases and language specification
For IOI/Codeforces datasets, additional metadata columns (subtask info, test case paths) are required
The messages column is removed after formatting to avoid conflicts

Step 4: Model_Loading_and_Reward_Setup

Load the base model and tokenizer, then resolve reward functions from the registry. Each reward function string name is mapped to its callable implementation, with parameterized rewards (cosine, repetition penalty, soft overlong punishment) receiving their configuration values. Code execution providers are initialized based on the selected provider type and router URLs.

Key considerations:

Reward functions are resolved dynamically from REWARD_FUNCS_REGISTRY
Parameterized rewards use partial application with config-driven parameters
Code execution providers support E2B, Morph, and Piston backends
PEFT (LoRA) configuration can be applied for parameter-efficient training

Step 5: GRPO_Training_Loop

Launch the GRPOTrainer which orchestrates the training loop. For each batch: the policy generates num_generations responses per prompt using vLLM, reward functions score each response, and the GRPO algorithm computes advantages relative to the group mean and updates the policy. Training supports checkpoint resumption, gradient checkpointing, and periodic evaluation.

Key considerations:

num_generations (typically 14-16) controls the group size for relative advantage computation
vLLM handles fast parallel generation with configurable temperature
Training can span multiple nodes with separate vLLM server node
Gradient accumulation and checkpointing manage memory on large models
W&B logging tracks rewards, completion lengths, and training metrics per step

Step 6: Model_Saving_and_Publishing

Save the trained model with aligned generation config. The EOS token is synchronized to prevent unbounded generation. A model card is created with training metadata and pushed to the HuggingFace Hub. Per-checkpoint Hub revisions can be enabled via the PushToHubRevisionCallback for fine-grained model selection.

Key considerations:

Per-checkpoint publishing enables evaluating intermediate training states
The KV cache is re-enabled for inference after training
Hub revisions include step numbers for tracking training progress

Execution Diagram

GitHub URL

Workflow Repository