Principle:Alibaba ROLL Trajectory Collection
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Agentic_AI |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A data collection principle for assembling batches of completed multi-turn trajectories from asynchronous environment workers into training-ready tensors.
Description
Trajectory Collection is the process of gathering completed episodes from environment managers and assembling them into batches suitable for policy gradient training. The key challenge is that trajectories are produced asynchronously with varying lengths, and must be grouped, padded, and annotated with metadata (trajectory IDs, group IDs, environment tags, step scores, episode scores) before training.
The collection process ensures:
- Group completeness: All episodes within a group complete before the group is yielded for training
- Tensor formatting: Variable-length trajectories are padded and masks are created
- Metadata preservation: Environment-specific scores and IDs are preserved for advantage computation
Usage
Use this principle in the data collection phase of agentic RL training, after environment managers produce trajectories and before advantage estimation.
Theoretical Basis
For group-relative advantage estimation (GRPO, GiGPO), collecting multiple trajectories per initial state enables:
The group structure is determined by the seed assignment:
- group_seed = base_seed + group_id
- episode_seed = group_seed + episode_id
Related Pages
Implemented By
Related Heuristics
The following heuristics inform this principle: