Principle:Alibaba ROLL Agentic Reward Computation

Knowledge Sources	GiGPO Alibaba ROLL
Domains	Reinforcement_Learning, Agentic_AI
Last Updated	2026-02-07 20:00 GMT

Overview

A multi-level reward computation principle that combines episode-level outcome rewards with step-level progress rewards for agentic RL training.

Description

Agentic Reward Computation handles the unique challenge of multi-turn RL training where rewards come at multiple levels:

Episode-level rewards: Binary success/failure or score for the entire trajectory (e.g., solved the puzzle)
Step-level rewards: Intermediate progress signals at each interaction step (e.g., moved closer to goal)

The principle supports three reward computation modes:

GiGPO mode: Combines episode and step rewards with configurable weights, normalizes step rewards within state groups (same initial state across episodes in a group)
Step-Reinforce mode: Uses only discounted step-level rewards
Standard mode: Uses cumulative episode scores with group normalization

Multi-level reward normalization prevents reward scale mismatch between episode and step signals.

Usage

Use this principle after trajectory collection and before advantage estimation in agentic RL pipelines. The mode is determined by the advantage estimator configuration (gigpo, step_reinforce, or standard).

Theoretical Basis

GiGPO Reward Decomposition

$r_{i} = w_{e p i s o d e} \cdot {\hat{r}}_{e p i s o d e, i} + w_{s t e p} \cdot {\hat{r}}_{s t e p, i}$

Where rewards are normalized within groups: ${\hat{r}}_{e p i s o d e, i} = \frac{r_{e p i s o d e, i} - μ_{g r o u p}}{σ_{g r o u p}}$

Step rewards use discounted returns: $G_{t} = \sum_{k = 0}^{T - t} γ^{k} r_{t + k}$

Related Pages

Implemented By

Implementation:Alibaba_ROLL_Compute_Response_Level_Rewards

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment