Implementation:Alibaba ROLL Compute Response Level Rewards
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Agentic_AI |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Concrete multi-level reward computation functions for agentic RL training provided by the Alibaba ROLL library.
Description
The compute_response_level_rewards and compute_discounted_returns functions implement multi-level reward computation for agentic training. compute_discounted_returns converts step scores into discounted returns per trajectory. compute_response_level_rewards combines episode and step rewards with configurable weights and normalization, supporting GiGPO, step-reinforce, and standard modes.
Usage
Called by the agentic pipeline after trajectory collection and before advantage estimation.
Code Reference
Source Location
- Repository: Alibaba ROLL
- File: roll/pipeline/agentic/utils.py
- Lines: L59-233
Signature
@torch.no_grad()
def compute_discounted_returns(
batch: DataProto,
adv_estimator: str,
gamma: float = 1.0
) -> DataProto:
"""
Compute discounted returns for each trajectory.
Args:
batch: DataProto with step_scores per trajectory
adv_estimator: Only "gigpo" or "step_reinforce" triggers computation
gamma: Discount factor (default 1.0)
Returns:
DataProto with step_rewards (discounted returns per step)
"""
@torch.no_grad()
def compute_response_level_rewards(
batch: DataProto,
pipeline_config: AgenticConfig
) -> Tuple[DataProto, Dict]:
"""
Compute response-level rewards with multi-level normalization.
Args:
batch: DataProto with scores, step_rewards, episode_scores
pipeline_config: AgenticConfig with reward weights and normalization
Returns:
(DataProto with response_level_rewards, metrics dict)
"""
Import
from roll.pipeline.agentic.utils import compute_discounted_returns, compute_response_level_rewards
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| batch | DataProto | Yes | Trajectories with scores, step_scores, episode_scores, traj_group_id |
| pipeline_config | AgenticConfig | Yes | Config with episode_reward_weight, step_reward_weight, step_reward_gamma |
| adv_estimator | str | Yes | Advantage estimator type (gigpo/step_reinforce/other) |
Outputs
| Name | Type | Description |
|---|---|---|
| response_level_rewards | torch.Tensor | Combined and normalized per-sample rewards |
| step_rewards | torch.Tensor | Discounted returns per step (for GiGPO/step_reinforce) |
| metrics | Dict | Reward statistics and normalization metrics |
Usage Examples
from roll.pipeline.agentic.utils import compute_discounted_returns, compute_response_level_rewards
# Step 1: Compute discounted returns from step scores
batch = compute_discounted_returns(batch, adv_estimator="gigpo", gamma=0.99)
# Step 2: Compute combined response-level rewards
batch, reward_metrics = compute_response_level_rewards(batch, agentic_config)
# Access rewards
rewards = batch.batch["response_level_rewards"]
Related Pages
Implements Principle
Requires Environment
Environment Dependencies
This implementation requires the following environment constraints:
Heuristics Applied
This implementation uses the following heuristics: