Workflow:OpenRLHF OpenRLHF Math Reasoning Training

Knowledge Sources	OpenRLHF Ray vLLM DeepSpeed
Domains	LLMs, RLHF, Math_Reasoning, REINFORCE
Last Updated	2026-02-07 10:00 GMT

Overview

End-to-end process for training mathematical reasoning capabilities in language models using REINFORCE++ with custom math reward verification and extended generation.

Description

This workflow trains language models to solve mathematical problems through prolonged RL exposure with verified reward signals. It uses REINFORCE++-baseline (or GRPO) as the advantage estimator, removing the need for a critic network. A custom math reward function verifies answers by parsing boxed LaTeX expressions and comparing against ground truth labels, producing binary (correct/incorrect) rewards. The training supports extended generation (up to 8192+ tokens) for chain-of-thought reasoning, dynamic filtering to skip uninformative samples, and importance sampling correction (ICEPOP) for robustness. This represents the state-of-the-art approach for training reasoning models.

Usage

Execute this workflow when you want to train a language model to solve math problems or develop extended reasoning capabilities. This is the workflow used for producing models similar to DeepSeek-R1 or similar reasoning specialists. It requires a dataset of math problems with ground truth answers and a custom reward function that can verify correctness. The approach works best with models that already have basic instruction-following ability.

Execution Steps

Step 1: Initialize Ray cluster with hybrid engine

Start the Ray runtime and configure GPU placement for the hybrid engine mode. In this workflow, all models (Actor, Reference) share the same GPU set, cycling between training and inference phases. No separate critic or reward model nodes are needed since REINFORCE++ eliminates the critic and rewards are computed via a custom function.

Key considerations:

Hybrid engine colocation reduces total GPU requirements
Sleep/wake memory management enables GPU sharing between training and inference
REINFORCE++-baseline requires no critic network, simplifying resource allocation

Step 2: Create vLLM engines with extended generation

Initialize vLLM engines configured for extended generation lengths (8192+ tokens). Enable features like ring attention for long-context support and dynamic batching with token-level limits.

Key considerations:

Extended generation length (8192+) is critical for chain-of-thought reasoning
Ring attention enables efficient processing of very long sequences
Max token limits per batch prevent OOM from variable-length generations
Temperature and top-p control the diversity of reasoning chains

Step 3: Configure custom math reward function

Set up the math reward function as the remote reward model URL. The reward function receives generated responses, extracts the final answer from boxed LaTeX expressions, and compares against ground truth labels. Returns binary rewards (1.0 for correct, 0.0 for incorrect).

Key considerations:

The reward function must parse the model's output format (e.g., \\boxed{answer})
Binary rewards provide a clean signal but are sparse
Multiple samples per prompt (e.g., 16) help overcome reward sparsity
Dynamic filtering removes prompts where all or no samples are correct (uninformative)

Step 4: Generate reasoning traces

Generate multiple candidate solutions per math problem (typically 8-16 per prompt). The model produces extended chain-of-thought reasoning traces ending with a boxed answer. vLLM handles efficient batched generation across the prompt dataset.

Key considerations:

Higher samples per prompt (16) improve advantage estimation quality
Temperature settings affect reasoning diversity
Long generation lengths require careful memory management

Step 5: Score and compute advantages

Apply the custom math reward function to all generated responses. Compute REINFORCE++-baseline advantages using group-level reward statistics (mean and variance across samples for each prompt). Apply dynamic filtering to remove prompts where all responses received the same reward (no learning signal).

Key considerations:

REINFORCE++-baseline uses the mean reward across samples as the baseline
Group normalization (GRPO variant) normalizes advantages within each prompt group
Dynamic filtering with reward range [0, 1] removes trivial prompt groups
KL divergence loss provides additional regularization

Step 6: Train policy with importance sampling

Update the Actor policy using the computed advantages. Apply ICEPOP (importance sampling correction) to handle staleness between generation and training. Use KL loss with k2 estimator for stable divergence control. Train over multiple epochs per rollout batch.

Key considerations:

ICEPOP corrects for the distribution shift between generation and training
KL loss coefficient controls the balance between reward optimization and reference adherence
Clip range can be asymmetric (Clip-Higher) to encourage exploration
Monitor math accuracy on evaluation sets (e.g., AIME-2024) during training

Step 7: Evaluate and checkpoint

Periodically evaluate the model on held-out math benchmarks. Save checkpoints at regular intervals for tracking training progress and enabling recovery.

Key considerations:

Evaluation on competition math (AIME, MATH) tracks reasoning improvement
Checkpointing every few steps enables recovery from training instabilities
Prolonged training (100+ episodes) is typical for reasoning tasks

Execution Diagram

GitHub URL

Workflow Repository