Workflow:Allenai Open instruct GRPO Reinforcement Learning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Reinforcement_Learning, Post_Training, RLVR |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
End-to-end process for reinforcement learning with verifiable rewards (RLVR) using Group Relative Policy Optimization (GRPO) with async vLLM generation and packed training.
Description
This workflow implements GRPO, the RL method used in DeepSeek R1, adapted for instruction-following and mathematical reasoning tasks. It uses a distributed architecture with separate inference workers (vLLM) and training workers (DeepSpeed) coordinated via Ray. For each prompt, the model generates multiple candidate responses, which are scored by verifiable reward functions (math correctness, instruction following constraints). The advantage is computed relative to the group mean, and the policy is updated using a clipped objective.
The primary implementation is grpo_fast.py, which uses sequence packing, asynchronous generation, and zero-gradient batch skipping for significant speedups. An older grpo_vllm_thread_ray_gtrl.py provides a more vanilla implementation.
Usage
Execute this workflow when you have a DPO-aligned model (or SFT model) and want to further improve it on tasks with verifiable rewards, such as mathematical problem solving, code generation, or instruction following with measurable constraints. This is typically the third and final stage of the Tulu post-training pipeline. It can also be used in a "Zero-style" setting starting from a base model with no SFT/DPO.
Execution Steps
Step 1: Environment_Setup
Prepare the distributed training environment with Ray, vLLM, and DeepSpeed. For multi-node setups, Ray must be initialized to connect all nodes into a cluster. The ray_node_setup.sh script handles leader/worker discovery and Ray cluster formation.
Key considerations:
- Ray is required for coordinating vLLM inference workers and training workers
- vLLM tensor parallelism size determines how inference GPUs are partitioned
- The split between inference GPUs and training GPUs is configured via num_learners_per_node and vllm_num_engines
- For a single 8-GPU node, a typical split is 6 training GPUs and 2 inference GPUs
Step 2: RLVR_Data_Loading
Load the RLVR dataset containing prompts with verifiable ground truth answers. The dataset mixer supports combining multiple RLVR sources (math, code, instruction following). Each prompt includes metadata needed by the reward verifier (ground truth answer, constraint type, etc.).
Key considerations:
- RLVR datasets contain prompts with verifiable ground truth (not preference pairs)
- Common datasets include RLVR-GSM-MATH-IF-Mixed-Constraints for math and IF tasks
- The dataset can be iterated for multiple epochs (num_epochs parameter)
- An evaluation split is also loaded for periodic in-training evaluation
Step 3: vLLM_Generation
For each batch of prompts, generate multiple candidate responses using vLLM inference engines running on dedicated GPUs. The generation runs asynchronously in a separate thread, overlapping with training computation on the training GPUs. This pipelining is a key performance optimization.
Key considerations:
- Each prompt generates num_samples_per_prompt_rollout responses (e.g., 16)
- vLLM provides high-throughput generation with continuous batching
- Temperature and stop token settings control generation behavior
- Non-stop penalty discourages responses that fail to produce a stop token
Step 4: Reward_Computation
Score each generated response using verifiable reward functions. For math tasks, the ground truth verifier extracts the final answer and checks correctness. For instruction following, constraint-specific verifiers check whether format requirements are met. Optionally, a format reward encourages proper response structure.
Key considerations:
- Verifiable rewards produce binary (0/1) scores for correctness
- Format rewards are only applied when the task reward is non-zero (by default)
- If all responses in a group score identically, the group has zero advantage and is skipped
- The real_batch_size_ratio metric tracks what fraction of groups have non-zero gradients
Step 5: Advantage_Computation_and_Packing
Compute the group-relative advantage for each response by normalizing scores within each prompt group. Responses with zero advantage (from groups with uniform scores) are skipped. The remaining sequences are packed into efficient batches using sequence packing to minimize padding waste.
Key considerations:
- Advantage is computed as (score - group_mean) / (group_std + epsilon)
- Groups where all responses have the same score produce zero gradients and are skipped
- Sequence packing concatenates multiple shorter sequences into a single long sequence
- The packed_ratio metric shows the packing efficiency
Step 6: Policy_Update
Update the model policy using the GRPO objective with clipped importance sampling. The training uses DeepSpeed for memory-efficient distributed training. KL divergence from the reference policy is used as a regularizer. Multiple mini-batch updates can be performed per rollout batch.
Key considerations:
- The beta parameter controls the KL penalty strength
- Multiple KL estimators are available (kl, kl2, kl3)
- Gradient checkpointing reduces memory usage during backpropagation
- Training metrics include policy loss, KL divergence, entropy, and clipping fraction
Step 7: Evaluation_and_Checkpointing
Periodically evaluate the model on held-out prompts and save checkpoints. Evaluations compute verifiable correct rate and other metrics on the evaluation split. Checkpoints are saved at configurable intervals and can trigger downstream evaluation jobs on Beaker.
Key considerations:
- local_eval_every controls how often in-training evaluation runs
- save_freq controls how often checkpoints are saved
- Beaker evaluation jobs can be auto-launched after checkpointing
- Training continues until total_episodes is reached