Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL Trajectory Collection

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Agentic_AI
Last Updated 2026-02-07 20:00 GMT

Overview

A data collection principle for assembling batches of completed multi-turn trajectories from asynchronous environment workers into training-ready tensors.

Description

Trajectory Collection is the process of gathering completed episodes from environment managers and assembling them into batches suitable for policy gradient training. The key challenge is that trajectories are produced asynchronously with varying lengths, and must be grouped, padded, and annotated with metadata (trajectory IDs, group IDs, environment tags, step scores, episode scores) before training.

The collection process ensures:

  • Group completeness: All episodes within a group complete before the group is yielded for training
  • Tensor formatting: Variable-length trajectories are padded and masks are created
  • Metadata preservation: Environment-specific scores and IDs are preserved for advantage computation

Usage

Use this principle in the data collection phase of agentic RL training, after environment managers produce trajectories and before advantage estimation.

Theoretical Basis

For group-relative advantage estimation (GRPO, GiGPO), collecting multiple trajectories per initial state enables:

A^i=riμgroupσgroup

The group structure is determined by the seed assignment:

  • group_seed = base_seed + group_id
  • episode_seed = group_seed + episode_id

Related Pages

Implemented By

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment