Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL Agentic Reward Computation

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Agentic_AI
Last Updated 2026-02-07 20:00 GMT

Overview

A multi-level reward computation principle that combines episode-level outcome rewards with step-level progress rewards for agentic RL training.

Description

Agentic Reward Computation handles the unique challenge of multi-turn RL training where rewards come at multiple levels:

  • Episode-level rewards: Binary success/failure or score for the entire trajectory (e.g., solved the puzzle)
  • Step-level rewards: Intermediate progress signals at each interaction step (e.g., moved closer to goal)

The principle supports three reward computation modes:

  1. GiGPO mode: Combines episode and step rewards with configurable weights, normalizes step rewards within state groups (same initial state across episodes in a group)
  2. Step-Reinforce mode: Uses only discounted step-level rewards
  3. Standard mode: Uses cumulative episode scores with group normalization

Multi-level reward normalization prevents reward scale mismatch between episode and step signals.

Usage

Use this principle after trajectory collection and before advantage estimation in agentic RL pipelines. The mode is determined by the advantage estimator configuration (gigpo, step_reinforce, or standard).

Theoretical Basis

GiGPO Reward Decomposition

ri=wepisoder^episode,i+wstepr^step,i

Where rewards are normalized within groups: r^episode,i=repisode,iμgroupσgroup

Step rewards use discounted returns: Gt=k=0Ttγkrt+k

Related Pages

Implemented By

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment