Workflow:Alibaba ROLL RLVR Training Pipeline

Knowledge Sources	Alibaba ROLL ROLL Tech Report ROLL Documentation
Domains	LLMs, Reinforcement_Learning, Distributed_Training
Last Updated	2026-02-07 19:00 GMT

Overview

End-to-end process for training Large Language Models using Reinforcement Learning with Verifiable Rewards (RLVR) across multiple domains including math, code, and general reasoning.

Description

This workflow implements the core RLVR training pipeline in the ROLL framework. It trains an LLM policy using reinforcement learning where rewards come from verifiable, rule-based evaluation functions rather than learned reward models. The pipeline orchestrates multiple distributed worker roles (actor, critic, reference, reward) across a Ray cluster, generating responses via high-throughput inference engines (vLLM or SGLang), computing domain-specific rewards, and updating the policy using algorithms such as GRPO, Reinforce++, PPO, or TOPR. Multi-domain training is supported through dynamic domain interleaving with configurable sampling probabilities.

Usage

Execute this workflow when you have a base or instruction-tuned LLM (e.g., Qwen2.5-7B) and domain-specific prompt datasets with verifiable reward functions (math correctness checks, code sandbox execution, rule-based evaluation), and you want to improve the model's reasoning capabilities through online reinforcement learning on a GPU cluster (8+ GPUs recommended).

Execution Steps

Step 1: Environment Setup and Configuration

Prepare the compute environment by installing ROLL in a Docker container with the appropriate GPU drivers and inference engine (vLLM or SGLang). Define the training configuration in a Hydra YAML file specifying the model, dataset paths, worker device mappings, distributed strategy backends, RL algorithm parameters, and reward worker configurations.

Key considerations:

Select the appropriate distributed training backend (Megatron-Core, DeepSpeed ZeRO, or FSDP2) based on model size and GPU count
Configure device mappings to allocate GPUs between training, inference, reference, and reward workers
Set rollout_batch_size, num_return_sequences_in_group, and response_length based on available GPU memory

Step 2: Dataset Preparation

Prepare multi-domain prompt datasets in JSONL format with domain tags. Each domain requires a corresponding reward function (math rule, code sandbox, LLM judge, IFEval, etc.). Configure domain interleave probabilities to control the sampling ratio across domains during training.

Key considerations:

Each prompt must include a domain tag that routes it to the correct reward worker
Domain interleave probabilities should sum to 1.0 and reflect training priorities
Prompts are tokenized using the model's chat template with configurable prompt_length limits

Step 3: Distributed Worker Initialization

Launch the Ray cluster and initialize distributed worker groups: actor training cluster (policy optimization), actor inference cluster (response generation), reference model cluster (KL divergence computation), and one or more reward worker clusters (domain-specific reward evaluation). Each cluster loads the model with its designated strategy backend.

What happens:

Actor training workers load the model with the training strategy (Megatron, DeepSpeed, or FSDP2)
Actor inference workers load the model with a high-throughput inference engine (vLLM or SGLang)
Reference workers load a frozen copy of the initial policy for KL penalty computation
Reward workers initialize domain-specific evaluators (math parsers, code sandboxes, LLM judges)

Step 4: Response Generation (Rollout)

Sample a batch of prompts from the multi-domain dataset and generate multiple response sequences per prompt using the actor inference engine. The generation uses configurable sampling parameters (temperature, top-p) and produces num_return_sequences_in_group responses per prompt for variance reduction.

Key considerations:

Inference workers use offload/reload cycles to share GPUs with training workers in colocated mode
Dynamic sampling schedules generation across multiple inference workers with load balancing
Difficulty masking can filter out prompts that are too easy or too hard based on historical reward statistics

Step 5: Reward Computation

Route generated responses to the appropriate domain-specific reward workers based on their domain tags. Each reward worker evaluates the response using its verification method: math rule checking extracts and validates answers, code sandbox executes generated code against test cases, LLM judge scores open-ended responses, and IFEval checks instruction-following compliance.

What happens:

Responses are dispatched to reward workers based on domain tags
Each reward worker computes a scalar reward per response
Rewards are collected, clipped (reward_clip), and optionally normalized per domain
Response-level rewards are aggregated back into the training batch

Step 6: Advantage Estimation and KL Penalty

Compute reference model log probabilities for KL divergence penalty. Calculate token-level rewards by combining response rewards with per-token KL penalties (init_kl_coef controls the penalty strength). Estimate advantages using the configured algorithm (GRPO group-relative normalization, GAE for PPO, or batch-level normalization for Reinforce++).

Key considerations:

KL coefficient can be adaptively adjusted based on a target KL divergence
Advantages are optionally whitened (zero mean, unit variance) for training stability
Advantage clipping bounds extreme advantage values to prevent destabilizing updates

Step 7: Policy Optimization

Update the actor model parameters using the computed advantages and the selected RL algorithm's loss function. The training step applies gradient accumulation across micro-batches, clips the policy ratio (PPO-style), and optionally updates the critic model (for PPO with GAE). Gradient norms are tracked and clipped for stability.

What happens:

Forward pass computes current log probabilities under the updated policy
Loss is computed using the algorithm-specific objective (clipped surrogate, GRPO, etc.)
Gradients are accumulated across micro-batches and DP ranks
Optimizer step updates model weights with learning rate scheduling
Updated weights are synchronized to inference workers for the next rollout

Step 8: Evaluation and Checkpointing

Periodically evaluate the policy on a held-out validation dataset by generating responses and computing reward metrics. Save model checkpoints at configured intervals for recovery and deployment. Log training metrics (rewards, KL divergence, loss, advantages) to the configured tracker (TensorBoard, Weights and Biases, or SwanLab).

Key considerations:

Validation uses greedy or low-temperature sampling for deterministic evaluation
Checkpoints include both the model weights and optimizer state for training resumption
Megatron checkpoints can be converted to HuggingFace format for deployment

Execution Diagram

GitHub URL

Workflow Repository