Workflow:Alibaba ROLL Agentic RL Training Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Reinforcement_Learning, Agentic_AI, Distributed_Training |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
End-to-end process for training LLM agents through multi-turn environment interaction using reinforcement learning with trajectory-based or step-based reward optimization.
Description
This workflow implements the Agentic RL training pipeline in the ROLL framework. Unlike the RLVR pipeline which processes single-turn prompt-response pairs, the Agentic pipeline trains LLM agents that interact with environments over multiple turns, collecting trajectories of observations, actions, and rewards. It supports diverse environments (Sokoban, FrozenLake, WebShop, GEM tasks) and multiple training paradigms including trajectory-wise optimization (StarPO) and step-wise optimization (GiGPO). The pipeline uses asynchronous parallel rollout at environment granularity for efficient trajectory collection.
Usage
Execute this workflow when you have an instruction-tuned LLM and want to train it as an agent that interacts with structured environments (games, web navigation, tool use, math reasoning with tools) through multi-turn dialogue, using RL to optimize the agent's sequential decision-making capabilities.
Execution Steps
Step 1: Environment Setup and Configuration
Prepare the compute environment and define the Hydra YAML configuration specifying the model, environment definitions, worker device mappings, and RL algorithm parameters. Configure the environment manager type (TrajEnvManager for trajectory-based or StepEnvManager for step-based) and define training and validation environment specifications.
Key considerations:
- Choose between trajectory-wise (StarPO) and step-wise (GiGPO) training based on the task structure
- Configure num_env_groups and group_size to control the number of parallel environment instances and responses per environment
- Define separate training and validation environment configurations with appropriate generation parameters
Step 2: Environment and Dataset Preparation
Prepare the environment implementations and prompt datasets. Environments must implement the ROLL environment interface providing reset, step, and reward methods. For GEM-based environments, configure the environment YAML with task definitions, tool access, and reward functions. Prepare prompt datasets that seed the environment interactions.
What happens:
- Environment classes are registered with the environment manager
- Initial prompts or environment states are loaded from dataset files
- Validation environments are configured with deterministic sampling (low temperature) for consistent evaluation
Step 3: Distributed Worker Initialization
Launch the Ray cluster and initialize worker groups: actor training cluster, actor inference cluster, optional critic cluster (for PPO), and reference model cluster. Initialize rollout schedulers for both training and validation that manage the asynchronous environment interaction loop.
Key considerations:
- Rollout schedulers manage environment-level asynchronous parallel execution
- Partial GPU mode can share GPUs between training and inference by shrinking/expanding the inference sampler
- The rollout scheduler uses a GroupQueueManager to collect complete trajectory groups before dispatching to training
Step 4: Trajectory Collection (Rollout)
Execute multi-turn environment interactions by alternating between LLM response generation and environment step execution. The rollout scheduler manages parallel environments, collecting complete trajectories when episodes terminate. Each trajectory contains the full sequence of observations, actions, and environment rewards.
What happens:
- Environments are reset with initial observations
- The LLM generates actions based on current observations (with chat-template formatting)
- Environments process actions and return new observations, rewards, and done flags
- Asynchronous parallel rollout processes environments independently without synchronization barriers
- Complete trajectories are grouped and batched for training
Step 5: Reward Computation and Return Estimation
Compute trajectory-level or step-level rewards from environment signals. For trajectory-wise training, the final environment reward is propagated to all tokens in the trajectory. For step-wise training (GiGPO), rewards are assigned at each action step with discounted returns computed across the trajectory.
Key considerations:
- Reward normalization uses group-based mean-std normalization across trajectory groups
- KL penalty is computed per-token using reference model log probabilities
- Discounted returns can be computed with configurable gamma for multi-step credit assignment
Step 6: Advantage Estimation
Compute advantages using the configured estimator. GRPO computes group-relative advantages by normalizing rewards within each trajectory group. GiGPO computes step-level advantages for fine-grained credit assignment within trajectories. PPO uses GAE (Generalized Advantage Estimation) with a learned value function.
Key considerations:
- Advantage whitening stabilizes training across batches
- Advantage clipping prevents extreme updates from outlier trajectories
- Segment-masked mean computation handles variable-length trajectories within packed batches
Step 7: Policy Optimization
Update the actor model using the computed advantages. The training step processes packed trajectories with attention masks that respect turn boundaries. For step-wise training, loss is computed per action step. Updated weights are synchronized to inference workers for the next rollout cycle.
What happens:
- Forward pass computes log probabilities for all action tokens in the trajectory
- Algorithm-specific loss (GRPO, PPO clip, Reinforce++) is computed with advantages
- Gradients are accumulated and applied with learning rate scheduling
- Model weights are broadcast to inference workers
Step 8: Validation and Checkpointing
Periodically evaluate the agent on validation environments using low-temperature sampling. Track success rates, average rewards, and trajectory lengths across different environment types. Save model checkpoints and log metrics to the configured tracking backend.
Key considerations:
- Validation environments may include different difficulty levels or environment variants
- Async validation runs concurrently with training to minimize idle time
- Per-environment-type metrics provide granular performance tracking