Workflow:Alibaba ROLL RLVR Training Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Reinforcement_Learning, Distributed_Training |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
End-to-end process for training Large Language Models using Reinforcement Learning with Verifiable Rewards (RLVR) across multiple domains including math, code, and general reasoning.
Description
This workflow implements the core RLVR training pipeline in the ROLL framework. It trains an LLM policy using reinforcement learning where rewards come from verifiable, rule-based evaluation functions rather than learned reward models. The pipeline orchestrates multiple distributed worker roles (actor, critic, reference, reward) across a Ray cluster, generating responses via high-throughput inference engines (vLLM or SGLang), computing domain-specific rewards, and updating the policy using algorithms such as GRPO, Reinforce++, PPO, or TOPR. Multi-domain training is supported through dynamic domain interleaving with configurable sampling probabilities.
Usage
Execute this workflow when you have a base or instruction-tuned LLM (e.g., Qwen2.5-7B) and domain-specific prompt datasets with verifiable reward functions (math correctness checks, code sandbox execution, rule-based evaluation), and you want to improve the model's reasoning capabilities through online reinforcement learning on a GPU cluster (8+ GPUs recommended).
Execution Steps
Step 1: Environment Setup and Configuration
Prepare the compute environment by installing ROLL in a Docker container with the appropriate GPU drivers and inference engine (vLLM or SGLang). Define the training configuration in a Hydra YAML file specifying the model, dataset paths, worker device mappings, distributed strategy backends, RL algorithm parameters, and reward worker configurations.
Key considerations:
- Select the appropriate distributed training backend (Megatron-Core, DeepSpeed ZeRO, or FSDP2) based on model size and GPU count
- Configure device mappings to allocate GPUs between training, inference, reference, and reward workers
- Set rollout_batch_size, num_return_sequences_in_group, and response_length based on available GPU memory
Step 2: Dataset Preparation
Prepare multi-domain prompt datasets in JSONL format with domain tags. Each domain requires a corresponding reward function (math rule, code sandbox, LLM judge, IFEval, etc.). Configure domain interleave probabilities to control the sampling ratio across domains during training.
Key considerations:
- Each prompt must include a domain tag that routes it to the correct reward worker
- Domain interleave probabilities should sum to 1.0 and reflect training priorities
- Prompts are tokenized using the model's chat template with configurable prompt_length limits
Step 3: Distributed Worker Initialization
Launch the Ray cluster and initialize distributed worker groups: actor training cluster (policy optimization), actor inference cluster (response generation), reference model cluster (KL divergence computation), and one or more reward worker clusters (domain-specific reward evaluation). Each cluster loads the model with its designated strategy backend.
What happens:
- Actor training workers load the model with the training strategy (Megatron, DeepSpeed, or FSDP2)
- Actor inference workers load the model with a high-throughput inference engine (vLLM or SGLang)
- Reference workers load a frozen copy of the initial policy for KL penalty computation
- Reward workers initialize domain-specific evaluators (math parsers, code sandboxes, LLM judges)
Step 4: Response Generation (Rollout)
Sample a batch of prompts from the multi-domain dataset and generate multiple response sequences per prompt using the actor inference engine. The generation uses configurable sampling parameters (temperature, top-p) and produces num_return_sequences_in_group responses per prompt for variance reduction.
Key considerations:
- Inference workers use offload/reload cycles to share GPUs with training workers in colocated mode
- Dynamic sampling schedules generation across multiple inference workers with load balancing
- Difficulty masking can filter out prompts that are too easy or too hard based on historical reward statistics
Step 5: Reward Computation
Route generated responses to the appropriate domain-specific reward workers based on their domain tags. Each reward worker evaluates the response using its verification method: math rule checking extracts and validates answers, code sandbox executes generated code against test cases, LLM judge scores open-ended responses, and IFEval checks instruction-following compliance.
What happens:
- Responses are dispatched to reward workers based on domain tags
- Each reward worker computes a scalar reward per response
- Rewards are collected, clipped (reward_clip), and optionally normalized per domain
- Response-level rewards are aggregated back into the training batch
Step 6: Advantage Estimation and KL Penalty
Compute reference model log probabilities for KL divergence penalty. Calculate token-level rewards by combining response rewards with per-token KL penalties (init_kl_coef controls the penalty strength). Estimate advantages using the configured algorithm (GRPO group-relative normalization, GAE for PPO, or batch-level normalization for Reinforce++).
Key considerations:
- KL coefficient can be adaptively adjusted based on a target KL divergence
- Advantages are optionally whitened (zero mean, unit variance) for training stability
- Advantage clipping bounds extreme advantage values to prevent destabilizing updates
Step 7: Policy Optimization
Update the actor model parameters using the computed advantages and the selected RL algorithm's loss function. The training step applies gradient accumulation across micro-batches, clips the policy ratio (PPO-style), and optionally updates the critic model (for PPO with GAE). Gradient norms are tracked and clipped for stability.
What happens:
- Forward pass computes current log probabilities under the updated policy
- Loss is computed using the algorithm-specific objective (clipped surrogate, GRPO, etc.)
- Gradients are accumulated across micro-batches and DP ranks
- Optimizer step updates model weights with learning rate scheduling
- Updated weights are synchronized to inference workers for the next rollout
Step 8: Evaluation and Checkpointing
Periodically evaluate the policy on a held-out validation dataset by generating responses and computing reward metrics. Save model checkpoints at configured intervals for recovery and deployment. Log training metrics (rewards, KL divergence, loss, advantages) to the configured tracker (TensorBoard, Weights and Biases, or SwanLab).
Key considerations:
- Validation uses greedy or low-temperature sampling for deterministic evaluation
- Checkpoints include both the model weights and optimizer state for training resumption
- Megatron checkpoints can be converted to HuggingFace format for deployment