Principle:Alibaba ROLL Agentic Reward Computation
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Agentic_AI |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A multi-level reward computation principle that combines episode-level outcome rewards with step-level progress rewards for agentic RL training.
Description
Agentic Reward Computation handles the unique challenge of multi-turn RL training where rewards come at multiple levels:
- Episode-level rewards: Binary success/failure or score for the entire trajectory (e.g., solved the puzzle)
- Step-level rewards: Intermediate progress signals at each interaction step (e.g., moved closer to goal)
The principle supports three reward computation modes:
- GiGPO mode: Combines episode and step rewards with configurable weights, normalizes step rewards within state groups (same initial state across episodes in a group)
- Step-Reinforce mode: Uses only discounted step-level rewards
- Standard mode: Uses cumulative episode scores with group normalization
Multi-level reward normalization prevents reward scale mismatch between episode and step signals.
Usage
Use this principle after trajectory collection and before advantage estimation in agentic RL pipelines. The mode is determined by the advantage estimator configuration (gigpo, step_reinforce, or standard).
Theoretical Basis
GiGPO Reward Decomposition
Where rewards are normalized within groups:
Step rewards use discounted returns:
Related Pages
Implemented By
Related Heuristics
The following heuristics inform this principle: