Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Alibaba ROLL Trajectory Collection

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Agentic_AI
Last Updated 2026-02-07 20:00 GMT

Overview

A data collection principle for assembling batches of completed multi-turn trajectories from asynchronous environment workers into training-ready tensors.

Description

Trajectory Collection is the process of gathering completed episodes from environment managers and assembling them into batches suitable for policy gradient training. The key challenge is that trajectories are produced asynchronously with varying lengths, and must be grouped, padded, and annotated with metadata (trajectory IDs, group IDs, environment tags, step scores, episode scores) before training.

The collection process ensures:

  • Group completeness: All episodes within a group complete before the group is yielded for training
  • Tensor formatting: Variable-length trajectories are padded and masks are created
  • Metadata preservation: Environment-specific scores and IDs are preserved for advantage computation

Usage

Use this principle in the data collection phase of agentic RL training, after environment managers produce trajectories and before advantage estimation.

Theoretical Basis

For group-relative advantage estimation (GRPO, GiGPO), collecting multiple trajectories per initial state enables:

A^i=riμgroupσgroup

The group structure is determined by the seed assignment:

  • group_seed = base_seed + group_id
  • episode_seed = group_seed + episode_id

Related Pages

Implemented By

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment