Workflow:Allenai Open instruct GRPO Reinforcement Learning

Knowledge Sources	Open Instruct Tulu 3 DeepSeek R1 DeepSeekMath Open Instruct Docs
Domains	LLMs, Reinforcement_Learning, Post_Training, RLVR
Last Updated	2026-02-07 00:00 GMT

Overview

End-to-end process for reinforcement learning with verifiable rewards (RLVR) using Group Relative Policy Optimization (GRPO) with async vLLM generation and packed training.

Description

This workflow implements GRPO, the RL method used in DeepSeek R1, adapted for instruction-following and mathematical reasoning tasks. It uses a distributed architecture with separate inference workers (vLLM) and training workers (DeepSpeed) coordinated via Ray. For each prompt, the model generates multiple candidate responses, which are scored by verifiable reward functions (math correctness, instruction following constraints). The advantage is computed relative to the group mean, and the policy is updated using a clipped objective.

The primary implementation is grpo_fast.py, which uses sequence packing, asynchronous generation, and zero-gradient batch skipping for significant speedups. An older grpo_vllm_thread_ray_gtrl.py provides a more vanilla implementation.

Usage

Execute this workflow when you have a DPO-aligned model (or SFT model) and want to further improve it on tasks with verifiable rewards, such as mathematical problem solving, code generation, or instruction following with measurable constraints. This is typically the third and final stage of the Tulu post-training pipeline. It can also be used in a "Zero-style" setting starting from a base model with no SFT/DPO.

Execution Steps

Step 1: Environment_Setup

Prepare the distributed training environment with Ray, vLLM, and DeepSpeed. For multi-node setups, Ray must be initialized to connect all nodes into a cluster. The ray_node_setup.sh script handles leader/worker discovery and Ray cluster formation.

Key considerations:

Ray is required for coordinating vLLM inference workers and training workers
vLLM tensor parallelism size determines how inference GPUs are partitioned
The split between inference GPUs and training GPUs is configured via num_learners_per_node and vllm_num_engines
For a single 8-GPU node, a typical split is 6 training GPUs and 2 inference GPUs

Step 2: RLVR_Data_Loading

Load the RLVR dataset containing prompts with verifiable ground truth answers. The dataset mixer supports combining multiple RLVR sources (math, code, instruction following). Each prompt includes metadata needed by the reward verifier (ground truth answer, constraint type, etc.).

Key considerations:

RLVR datasets contain prompts with verifiable ground truth (not preference pairs)
Common datasets include RLVR-GSM-MATH-IF-Mixed-Constraints for math and IF tasks
The dataset can be iterated for multiple epochs (num_epochs parameter)
An evaluation split is also loaded for periodic in-training evaluation

Step 3: vLLM_Generation

For each batch of prompts, generate multiple candidate responses using vLLM inference engines running on dedicated GPUs. The generation runs asynchronously in a separate thread, overlapping with training computation on the training GPUs. This pipelining is a key performance optimization.

Key considerations:

Each prompt generates num_samples_per_prompt_rollout responses (e.g., 16)
vLLM provides high-throughput generation with continuous batching
Temperature and stop token settings control generation behavior
Non-stop penalty discourages responses that fail to produce a stop token

Step 4: Reward_Computation

Score each generated response using verifiable reward functions. For math tasks, the ground truth verifier extracts the final answer and checks correctness. For instruction following, constraint-specific verifiers check whether format requirements are met. Optionally, a format reward encourages proper response structure.

Key considerations:

Verifiable rewards produce binary (0/1) scores for correctness
Format rewards are only applied when the task reward is non-zero (by default)
If all responses in a group score identically, the group has zero advantage and is skipped
The real_batch_size_ratio metric tracks what fraction of groups have non-zero gradients

Step 5: Advantage_Computation_and_Packing

Compute the group-relative advantage for each response by normalizing scores within each prompt group. Responses with zero advantage (from groups with uniform scores) are skipped. The remaining sequences are packed into efficient batches using sequence packing to minimize padding waste.

Key considerations:

Advantage is computed as (score - group_mean) / (group_std + epsilon)
Groups where all responses have the same score produce zero gradients and are skipped
Sequence packing concatenates multiple shorter sequences into a single long sequence
The packed_ratio metric shows the packing efficiency

Step 6: Policy_Update

Update the model policy using the GRPO objective with clipped importance sampling. The training uses DeepSpeed for memory-efficient distributed training. KL divergence from the reference policy is used as a regularizer. Multiple mini-batch updates can be performed per rollout batch.

Key considerations:

The beta parameter controls the KL penalty strength
Multiple KL estimators are available (kl, kl2, kl3)
Gradient checkpointing reduces memory usage during backpropagation
Training metrics include policy loss, KL divergence, entropy, and clipping fraction

Step 7: Evaluation_and_Checkpointing

Periodically evaluate the model on held-out prompts and save checkpoints. Evaluations compute verifiable correct rate and other metrics on the evaluation split. Checkpoints are saved at configurable intervals and can trigger downstream evaluation jobs on Beaker.

Key considerations:

local_eval_every controls how often in-training evaluation runs
save_freq controls how often checkpoints are saved
Beaker evaluation jobs can be auto-launched after checkpointing
Training continues until total_episodes is reached

Execution Diagram

GitHub URL

Workflow Repository