Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Volcengine Verl RewardManager Compute Reward

From Leeroopedia
Revision as of 17:07, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Volcengine_Verl_RewardManager_Compute_Reward.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Reinforcement_Learning, Reward_Computation
Last Updated 2026-02-07 14:00 GMT

Overview

Concrete tools for computing rule-based rewards using pluggable reward manager classes and custom scoring functions, provided by the verl library.

Description

The compute_reward function is a thin orchestration wrapper that invokes an AbstractRewardManager instance on a batch of data. It calls the reward manager with return_dict=True to obtain both a reward tensor and optional extra info, falling back to a plain tensor return on error. The load_reward_manager function is the factory that constructs the appropriate reward manager based on configuration: it loads a custom reward function (if configured), resolves the reward manager class via registration or importlib, optionally configures sandbox fusion for code execution scoring, and returns an initialized manager instance.

Usage

These functions are used in the PPO training loop to compute rewards for generated responses. The load_reward_manager is called once during trainer initialization, while compute_reward is called every training step after rollout generation.

Code Reference

Source Location

  • Repository: verl
  • File: verl/trainer/ppo/reward.py
  • Lines: 99-197

Signature

def load_reward_manager(
    config: DictConfig,
    tokenizer: Any,
    num_examine: int,
    **reward_kwargs: Any,
) -> AbstractRewardManager:
    """
    Load and initialize a reward manager based on the configuration.

    Args:
        config: PPO trainer configuration object containing reward_model fields.
        tokenizer: Tokenizer object used for processing text.
        num_examine: Number of samples to examine.
        **reward_kwargs: Additional keyword arguments for the reward manager.

    Returns:
        An instance of the specified reward manager class.
    """


@tqbridge(put_data=False)
def compute_reward(
    data: DataProto,
    reward_fn: AbstractRewardManager,
) -> tuple[torch.Tensor, dict[str, Any]]:
    """
    Compute reward for a batch of data.

    Args:
        data: DataProto object containing the input data.
        reward_fn: Reward function (AbstractRewardManager instance) to compute the reward.

    Returns:
        Tuple of reward tensor and extra info dictionary.
    """

Import

from verl.trainer.ppo.reward import compute_reward, load_reward_manager

I/O Contract

Inputs (load_reward_manager)

Name Type Required Description
config DictConfig Yes PPO trainer configuration containing reward_manager and reward_model settings
tokenizer Any Yes Tokenizer for text processing
num_examine int Yes Number of samples to examine/log for debugging
**reward_kwargs Any No Additional keyword arguments passed to the reward manager constructor

Inputs (compute_reward)

Name Type Required Description
data DataProto Yes Batch data containing prompts, responses, and metadata
reward_fn AbstractRewardManager Yes Initialized reward manager instance

Outputs (compute_reward)

Name Type Description
reward_tensor torch.Tensor Per-token reward tensor of shape (batch_size, sequence_length)
reward_extra_infos_dict dict[str, Any] Dictionary of extra reward information (e.g., per-metric scores)

Usage Examples

from omegaconf import OmegaConf
from verl.trainer.ppo.reward import compute_reward, load_reward_manager

# During trainer initialization
config = OmegaConf.load("config.yaml")
tokenizer = ...  # HuggingFace tokenizer

reward_manager = load_reward_manager(
    config=config,
    tokenizer=tokenizer,
    num_examine=3,
)

# During each training step, after rollout generation
# data is a DataProto containing generated responses
data = ...  # DataProto with batch of prompts + responses

reward_tensor, extra_info = compute_reward(
    data=data,
    reward_fn=reward_manager,
)

# reward_tensor shape: (batch_size, sequence_length)
# extra_info may contain per-metric breakdowns

Custom Reward Function Configuration

# In YAML config:
# custom_reward_function:
#   path: /path/to/my_reward.py
#   name: compute_score
#   reward_kwargs:
#     strict_format: true

# The custom function signature should be:
def compute_score(data_source, solution_str, ground_truth, extra_info=None, **kwargs):
    """Custom reward scoring function."""
    # Return a float score
    return score

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment