Principle:Alibaba ROLL Trajectory Collection

Knowledge Sources	PPO GiGPO Alibaba ROLL
Domains	Reinforcement_Learning, Agentic_AI
Last Updated	2026-02-07 20:00 GMT

Overview

A data collection principle for assembling batches of completed multi-turn trajectories from asynchronous environment workers into training-ready tensors.

Description

Trajectory Collection is the process of gathering completed episodes from environment managers and assembling them into batches suitable for policy gradient training. The key challenge is that trajectories are produced asynchronously with varying lengths, and must be grouped, padded, and annotated with metadata (trajectory IDs, group IDs, environment tags, step scores, episode scores) before training.

The collection process ensures:

Group completeness: All episodes within a group complete before the group is yielded for training
Tensor formatting: Variable-length trajectories are padded and masks are created
Metadata preservation: Environment-specific scores and IDs are preserved for advantage computation

Usage

Use this principle in the data collection phase of agentic RL training, after environment managers produce trajectories and before advantage estimation.

Theoretical Basis

For group-relative advantage estimation (GRPO, GiGPO), collecting multiple trajectories per initial state enables:

${\hat{A}}_{i} = \frac{r_{i} - μ_{group}}{σ_{group}}$

The group structure is determined by the seed assignment:

group_seed = base_seed + group_id
episode_seed = group_seed + episode_id

Related Pages

Implemented By

Implementation:Alibaba_ROLL_RolloutScheduler_Get_Batch

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment