Workflow:Alibaba ROLL Agentic RL Training Pipeline

Knowledge Sources	Alibaba ROLL ROLL Tech Report ROLL Documentation Let It Flow Agentic Report
Domains	LLMs, Reinforcement_Learning, Agentic_AI, Distributed_Training
Last Updated	2026-02-07 19:00 GMT

Overview

End-to-end process for training LLM agents through multi-turn environment interaction using reinforcement learning with trajectory-based or step-based reward optimization.

Description

This workflow implements the Agentic RL training pipeline in the ROLL framework. Unlike the RLVR pipeline which processes single-turn prompt-response pairs, the Agentic pipeline trains LLM agents that interact with environments over multiple turns, collecting trajectories of observations, actions, and rewards. It supports diverse environments (Sokoban, FrozenLake, WebShop, GEM tasks) and multiple training paradigms including trajectory-wise optimization (StarPO) and step-wise optimization (GiGPO). The pipeline uses asynchronous parallel rollout at environment granularity for efficient trajectory collection.

Usage

Execute this workflow when you have an instruction-tuned LLM and want to train it as an agent that interacts with structured environments (games, web navigation, tool use, math reasoning with tools) through multi-turn dialogue, using RL to optimize the agent's sequential decision-making capabilities.

Execution Steps

Step 1: Environment Setup and Configuration

Prepare the compute environment and define the Hydra YAML configuration specifying the model, environment definitions, worker device mappings, and RL algorithm parameters. Configure the environment manager type (TrajEnvManager for trajectory-based or StepEnvManager for step-based) and define training and validation environment specifications.

Key considerations:

Choose between trajectory-wise (StarPO) and step-wise (GiGPO) training based on the task structure
Configure num_env_groups and group_size to control the number of parallel environment instances and responses per environment
Define separate training and validation environment configurations with appropriate generation parameters

Step 2: Environment and Dataset Preparation

Prepare the environment implementations and prompt datasets. Environments must implement the ROLL environment interface providing reset, step, and reward methods. For GEM-based environments, configure the environment YAML with task definitions, tool access, and reward functions. Prepare prompt datasets that seed the environment interactions.

What happens:

Environment classes are registered with the environment manager
Initial prompts or environment states are loaded from dataset files
Validation environments are configured with deterministic sampling (low temperature) for consistent evaluation

Step 3: Distributed Worker Initialization

Launch the Ray cluster and initialize worker groups: actor training cluster, actor inference cluster, optional critic cluster (for PPO), and reference model cluster. Initialize rollout schedulers for both training and validation that manage the asynchronous environment interaction loop.

Key considerations:

Rollout schedulers manage environment-level asynchronous parallel execution
Partial GPU mode can share GPUs between training and inference by shrinking/expanding the inference sampler
The rollout scheduler uses a GroupQueueManager to collect complete trajectory groups before dispatching to training

Step 4: Trajectory Collection (Rollout)

Execute multi-turn environment interactions by alternating between LLM response generation and environment step execution. The rollout scheduler manages parallel environments, collecting complete trajectories when episodes terminate. Each trajectory contains the full sequence of observations, actions, and environment rewards.

What happens:

Environments are reset with initial observations
The LLM generates actions based on current observations (with chat-template formatting)
Environments process actions and return new observations, rewards, and done flags
Asynchronous parallel rollout processes environments independently without synchronization barriers
Complete trajectories are grouped and batched for training

Step 5: Reward Computation and Return Estimation

Compute trajectory-level or step-level rewards from environment signals. For trajectory-wise training, the final environment reward is propagated to all tokens in the trajectory. For step-wise training (GiGPO), rewards are assigned at each action step with discounted returns computed across the trajectory.

Key considerations:

Reward normalization uses group-based mean-std normalization across trajectory groups
KL penalty is computed per-token using reference model log probabilities
Discounted returns can be computed with configurable gamma for multi-step credit assignment

Step 6: Advantage Estimation

Compute advantages using the configured estimator. GRPO computes group-relative advantages by normalizing rewards within each trajectory group. GiGPO computes step-level advantages for fine-grained credit assignment within trajectories. PPO uses GAE (Generalized Advantage Estimation) with a learned value function.

Key considerations:

Advantage whitening stabilizes training across batches
Advantage clipping prevents extreme updates from outlier trajectories
Segment-masked mean computation handles variable-length trajectories within packed batches

Step 7: Policy Optimization

Update the actor model using the computed advantages. The training step processes packed trajectories with attention masks that respect turn boundaries. For step-wise training, loss is computed per action step. Updated weights are synchronized to inference workers for the next rollout cycle.

What happens:

Forward pass computes log probabilities for all action tokens in the trajectory
Algorithm-specific loss (GRPO, PPO clip, Reinforce++) is computed with advantages
Gradients are accumulated and applied with learning rate scheduling
Model weights are broadcast to inference workers

Step 8: Validation and Checkpointing

Periodically evaluate the agent on validation environments using low-temperature sampling. Track success rates, average rewards, and trajectory lengths across different environment types. Save model checkpoints and log metrics to the configured tracking backend.

Key considerations:

Validation environments may include different difficulty levels or environment variants
Async validation runs concurrently with training to minimize idle time
Per-environment-type metrics provide granular performance tracking

Execution Diagram

GitHub URL

Workflow Repository