Implementation:Volcengine Verl RewardManager Compute Reward

Knowledge Sources	verl
Domains	Reinforcement_Learning, Reward_Computation
Last Updated	2026-02-07 14:00 GMT

Overview

Concrete tools for computing rule-based rewards using pluggable reward manager classes and custom scoring functions, provided by the verl library.

Description

The compute_reward function is a thin orchestration wrapper that invokes an AbstractRewardManager instance on a batch of data. It calls the reward manager with return_dict=True to obtain both a reward tensor and optional extra info, falling back to a plain tensor return on error. The load_reward_manager function is the factory that constructs the appropriate reward manager based on configuration: it loads a custom reward function (if configured), resolves the reward manager class via registration or importlib, optionally configures sandbox fusion for code execution scoring, and returns an initialized manager instance.

Usage

These functions are used in the PPO training loop to compute rewards for generated responses. The load_reward_manager is called once during trainer initialization, while compute_reward is called every training step after rollout generation.

Code Reference

Source Location

Repository: verl
File: verl/trainer/ppo/reward.py
Lines: 99-197

Signature

def load_reward_manager(
    config: DictConfig,
    tokenizer: Any,
    num_examine: int,
    **reward_kwargs: Any,
) -> AbstractRewardManager:
    """
    Load and initialize a reward manager based on the configuration.

    Args:
        config: PPO trainer configuration object containing reward_model fields.
        tokenizer: Tokenizer object used for processing text.
        num_examine: Number of samples to examine.
        **reward_kwargs: Additional keyword arguments for the reward manager.

    Returns:
        An instance of the specified reward manager class.
    """


@tqbridge(put_data=False)
def compute_reward(
    data: DataProto,
    reward_fn: AbstractRewardManager,
) -> tuple[torch.Tensor, dict[str, Any]]:
    """
    Compute reward for a batch of data.

    Args:
        data: DataProto object containing the input data.
        reward_fn: Reward function (AbstractRewardManager instance) to compute the reward.

    Returns:
        Tuple of reward tensor and extra info dictionary.
    """

Import

from verl.trainer.ppo.reward import compute_reward, load_reward_manager

I/O Contract

Inputs (load_reward_manager)

Name	Type	Required	Description
config	DictConfig	Yes	PPO trainer configuration containing reward_manager and reward_model settings
tokenizer	Any	Yes	Tokenizer for text processing
num_examine	int	Yes	Number of samples to examine/log for debugging
**reward_kwargs	Any	No	Additional keyword arguments passed to the reward manager constructor

Inputs (compute_reward)

Name	Type	Required	Description
data	DataProto	Yes	Batch data containing prompts, responses, and metadata
reward_fn	AbstractRewardManager	Yes	Initialized reward manager instance

Outputs (compute_reward)

Name	Type	Description
reward_tensor	torch.Tensor	Per-token reward tensor of shape (batch_size, sequence_length)
reward_extra_infos_dict	dict[str, Any]	Dictionary of extra reward information (e.g., per-metric scores)

Usage Examples

from omegaconf import OmegaConf
from verl.trainer.ppo.reward import compute_reward, load_reward_manager

# During trainer initialization
config = OmegaConf.load("config.yaml")
tokenizer = ...  # HuggingFace tokenizer

reward_manager = load_reward_manager(
    config=config,
    tokenizer=tokenizer,
    num_examine=3,
)

# During each training step, after rollout generation
# data is a DataProto containing generated responses
data = ...  # DataProto with batch of prompts + responses

reward_tensor, extra_info = compute_reward(
    data=data,
    reward_fn=reward_manager,
)

# reward_tensor shape: (batch_size, sequence_length)
# extra_info may contain per-metric breakdowns

Custom Reward Function Configuration

# In YAML config:
# custom_reward_function:
#   path: /path/to/my_reward.py
#   name: compute_score
#   reward_kwargs:
#     strict_format: true

# The custom function signature should be:
def compute_score(data_source, solution_str, ground_truth, extra_info=None, **kwargs):
    """Custom reward scoring function."""
    # Return a float score
    return score

Related Pages

Implements Principle

Principle:Volcengine_Verl_Rule_Based_Reward_Computation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment