Implementation:Volcengine Verl RewardManager Compute Reward
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Reward_Computation |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Concrete tools for computing rule-based rewards using pluggable reward manager classes and custom scoring functions, provided by the verl library.
Description
The compute_reward function is a thin orchestration wrapper that invokes an AbstractRewardManager instance on a batch of data. It calls the reward manager with return_dict=True to obtain both a reward tensor and optional extra info, falling back to a plain tensor return on error. The load_reward_manager function is the factory that constructs the appropriate reward manager based on configuration: it loads a custom reward function (if configured), resolves the reward manager class via registration or importlib, optionally configures sandbox fusion for code execution scoring, and returns an initialized manager instance.
Usage
These functions are used in the PPO training loop to compute rewards for generated responses. The load_reward_manager is called once during trainer initialization, while compute_reward is called every training step after rollout generation.
Code Reference
Source Location
- Repository: verl
- File: verl/trainer/ppo/reward.py
- Lines: 99-197
Signature
def load_reward_manager(
config: DictConfig,
tokenizer: Any,
num_examine: int,
**reward_kwargs: Any,
) -> AbstractRewardManager:
"""
Load and initialize a reward manager based on the configuration.
Args:
config: PPO trainer configuration object containing reward_model fields.
tokenizer: Tokenizer object used for processing text.
num_examine: Number of samples to examine.
**reward_kwargs: Additional keyword arguments for the reward manager.
Returns:
An instance of the specified reward manager class.
"""
@tqbridge(put_data=False)
def compute_reward(
data: DataProto,
reward_fn: AbstractRewardManager,
) -> tuple[torch.Tensor, dict[str, Any]]:
"""
Compute reward for a batch of data.
Args:
data: DataProto object containing the input data.
reward_fn: Reward function (AbstractRewardManager instance) to compute the reward.
Returns:
Tuple of reward tensor and extra info dictionary.
"""
Import
from verl.trainer.ppo.reward import compute_reward, load_reward_manager
I/O Contract
Inputs (load_reward_manager)
| Name | Type | Required | Description |
|---|---|---|---|
| config | DictConfig | Yes | PPO trainer configuration containing reward_manager and reward_model settings |
| tokenizer | Any | Yes | Tokenizer for text processing |
| num_examine | int | Yes | Number of samples to examine/log for debugging |
| **reward_kwargs | Any | No | Additional keyword arguments passed to the reward manager constructor |
Inputs (compute_reward)
| Name | Type | Required | Description |
|---|---|---|---|
| data | DataProto | Yes | Batch data containing prompts, responses, and metadata |
| reward_fn | AbstractRewardManager | Yes | Initialized reward manager instance |
Outputs (compute_reward)
| Name | Type | Description |
|---|---|---|
| reward_tensor | torch.Tensor | Per-token reward tensor of shape (batch_size, sequence_length) |
| reward_extra_infos_dict | dict[str, Any] | Dictionary of extra reward information (e.g., per-metric scores) |
Usage Examples
from omegaconf import OmegaConf
from verl.trainer.ppo.reward import compute_reward, load_reward_manager
# During trainer initialization
config = OmegaConf.load("config.yaml")
tokenizer = ... # HuggingFace tokenizer
reward_manager = load_reward_manager(
config=config,
tokenizer=tokenizer,
num_examine=3,
)
# During each training step, after rollout generation
# data is a DataProto containing generated responses
data = ... # DataProto with batch of prompts + responses
reward_tensor, extra_info = compute_reward(
data=data,
reward_fn=reward_manager,
)
# reward_tensor shape: (batch_size, sequence_length)
# extra_info may contain per-metric breakdowns
Custom Reward Function Configuration
# In YAML config:
# custom_reward_function:
# path: /path/to/my_reward.py
# name: compute_score
# reward_kwargs:
# strict_format: true
# The custom function signature should be:
def compute_score(data_source, solution_str, ground_truth, extra_info=None, **kwargs):
"""Custom reward scoring function."""
# Return a float score
return score