Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL Compute Response Level Rewards

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Agentic_AI
Last Updated 2026-02-07 20:00 GMT

Overview

Concrete multi-level reward computation functions for agentic RL training provided by the Alibaba ROLL library.

Description

The compute_response_level_rewards and compute_discounted_returns functions implement multi-level reward computation for agentic training. compute_discounted_returns converts step scores into discounted returns per trajectory. compute_response_level_rewards combines episode and step rewards with configurable weights and normalization, supporting GiGPO, step-reinforce, and standard modes.

Usage

Called by the agentic pipeline after trajectory collection and before advantage estimation.

Code Reference

Source Location

  • Repository: Alibaba ROLL
  • File: roll/pipeline/agentic/utils.py
  • Lines: L59-233

Signature

@torch.no_grad()
def compute_discounted_returns(
    batch: DataProto,
    adv_estimator: str,
    gamma: float = 1.0
) -> DataProto:
    """
    Compute discounted returns for each trajectory.

    Args:
        batch: DataProto with step_scores per trajectory
        adv_estimator: Only "gigpo" or "step_reinforce" triggers computation
        gamma: Discount factor (default 1.0)

    Returns:
        DataProto with step_rewards (discounted returns per step)
    """

@torch.no_grad()
def compute_response_level_rewards(
    batch: DataProto,
    pipeline_config: AgenticConfig
) -> Tuple[DataProto, Dict]:
    """
    Compute response-level rewards with multi-level normalization.

    Args:
        batch: DataProto with scores, step_rewards, episode_scores
        pipeline_config: AgenticConfig with reward weights and normalization

    Returns:
        (DataProto with response_level_rewards, metrics dict)
    """

Import

from roll.pipeline.agentic.utils import compute_discounted_returns, compute_response_level_rewards

I/O Contract

Inputs

Name Type Required Description
batch DataProto Yes Trajectories with scores, step_scores, episode_scores, traj_group_id
pipeline_config AgenticConfig Yes Config with episode_reward_weight, step_reward_weight, step_reward_gamma
adv_estimator str Yes Advantage estimator type (gigpo/step_reinforce/other)

Outputs

Name Type Description
response_level_rewards torch.Tensor Combined and normalized per-sample rewards
step_rewards torch.Tensor Discounted returns per step (for GiGPO/step_reinforce)
metrics Dict Reward statistics and normalization metrics

Usage Examples

from roll.pipeline.agentic.utils import compute_discounted_returns, compute_response_level_rewards

# Step 1: Compute discounted returns from step scores
batch = compute_discounted_returns(batch, adv_estimator="gigpo", gamma=0.99)

# Step 2: Compute combined response-level rewards
batch, reward_metrics = compute_response_level_rewards(batch, agentic_config)

# Access rewards
rewards = batch.batch["response_level_rewards"]

Related Pages

Implements Principle

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Heuristics Applied

This implementation uses the following heuristics:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment