Implementation:Alibaba ROLL Compute Response Level Rewards

Knowledge Sources	Alibaba ROLL
Domains	Reinforcement_Learning, Agentic_AI
Last Updated	2026-02-07 20:00 GMT

Overview

Concrete multi-level reward computation functions for agentic RL training provided by the Alibaba ROLL library.

Description

The compute_response_level_rewards and compute_discounted_returns functions implement multi-level reward computation for agentic training. compute_discounted_returns converts step scores into discounted returns per trajectory. compute_response_level_rewards combines episode and step rewards with configurable weights and normalization, supporting GiGPO, step-reinforce, and standard modes.

Usage

Called by the agentic pipeline after trajectory collection and before advantage estimation.

Code Reference

Source Location

Repository: Alibaba ROLL
File: roll/pipeline/agentic/utils.py
Lines: L59-233

Signature

@torch.no_grad()
def compute_discounted_returns(
    batch: DataProto,
    adv_estimator: str,
    gamma: float = 1.0
) -> DataProto:
    """
    Compute discounted returns for each trajectory.

    Args:
        batch: DataProto with step_scores per trajectory
        adv_estimator: Only "gigpo" or "step_reinforce" triggers computation
        gamma: Discount factor (default 1.0)

    Returns:
        DataProto with step_rewards (discounted returns per step)
    """

@torch.no_grad()
def compute_response_level_rewards(
    batch: DataProto,
    pipeline_config: AgenticConfig
) -> Tuple[DataProto, Dict]:
    """
    Compute response-level rewards with multi-level normalization.

    Args:
        batch: DataProto with scores, step_rewards, episode_scores
        pipeline_config: AgenticConfig with reward weights and normalization

    Returns:
        (DataProto with response_level_rewards, metrics dict)
    """

Import

from roll.pipeline.agentic.utils import compute_discounted_returns, compute_response_level_rewards

I/O Contract

Inputs

Name	Type	Required	Description
batch	DataProto	Yes	Trajectories with scores, step_scores, episode_scores, traj_group_id
pipeline_config	AgenticConfig	Yes	Config with episode_reward_weight, step_reward_weight, step_reward_gamma
adv_estimator	str	Yes	Advantage estimator type (gigpo/step_reinforce/other)

Outputs

Name	Type	Description
response_level_rewards	torch.Tensor	Combined and normalized per-sample rewards
step_rewards	torch.Tensor	Discounted returns per step (for GiGPO/step_reinforce)
metrics	Dict	Reward statistics and normalization metrics

Usage Examples

from roll.pipeline.agentic.utils import compute_discounted_returns, compute_response_level_rewards

# Step 1: Compute discounted returns from step scores
batch = compute_discounted_returns(batch, adv_estimator="gigpo", gamma=0.99)

# Step 2: Compute combined response-level rewards
batch, reward_metrics = compute_response_level_rewards(batch, agentic_config)

# Access rewards
rewards = batch.batch["response_level_rewards"]

Related Pages

Implements Principle

Principle:Alibaba_ROLL_Agentic_Reward_Computation

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Environment:Alibaba_ROLL_CUDA_GPU_Environment

Heuristics Applied

This implementation uses the following heuristics:

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment