Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Volcengine Verl RewardModelWorker Compute Reward

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Reward_Modeling
Type API Doc
Last Updated 2026-02-07 14:00 GMT

Overview

API documentation for the learned reward model scoring subsystem in verl, which uses a transformer-based reward model to score generated responses.

Description

The reward model worker subsystem provides neural reward scoring as an alternative (or complement) to rule-based reward functions. When enabled, a separate transformer model (typically a sequence classifier fine-tuned on preference data) scores each generated response. The model reads the concatenated prompt and response, and the scalar reward at the EOS token position is extracted. Optionally, a KL divergence penalty between the actor policy and a reference policy can be added to the reward signal to prevent the actor from deviating too far from the reference model, controlled by a KL coefficient.

The subsystem is built around the BasePPORewardModel abstract class, which defines the compute_reward interface. Concrete implementations include MegatronRewardModel for Megatron-based backends. The reward model is configured via RewardModelConfig and uses HFModelConfig for model architecture settings.

Usage

Enable the reward model by setting reward_model.enable=True in the configuration. Provide the model path via reward_model.model.path. When KL regularization is desired, set algorithm.use_kl_in_reward=True and algorithm.kl_ctrl.kl_coef to the desired coefficient. The reward model worker is managed by the resource pool and can be colocated with other workers.

Code Reference

Source Location

  • Repository: verl
  • File (base class): verl/workers/reward_model/base.py
  • File (config): verl/workers/config/reward_model.py
  • File (Megatron impl): verl/workers/reward_model/megatron/reward_model.py

Signature

class BasePPORewardModel(ABC):
    """Base class for reward model."""

    def __init__(
        self,
        config: RewardModelConfig,
        model_config: HFModelConfig,
        device_mesh: DeviceMesh,
    ):
        self.config = config
        self.model_config = model_config
        self.device_mesh = device_mesh

    @abstractmethod
    def compute_reward(self, data: DataProto) -> DataProto:
        """
        Compute reward given input_ids.

        Args:
            data: Must contain keys "input_ids", "attention_mask", "position_ids".
                - input_ids: [batch_size, sequence_length]
                - attention_mask: [batch_size, sequence_length]
                - position_ids: [batch_size, sequence_length]

        Returns:
            DataProto containing "reward" key.
                - reward: [batch_size, sequence_length]
                  Only the [EOS] position contains the reward score.
        """
        pass


@dataclass
class RewardModelConfig(BaseConfig):
    enable: bool = False
    enable_resource_pool: bool = False
    n_gpus_per_node: int = 0
    nnodes: int = 0
    model_path: Optional[str] = None
    inference: RolloutConfig = field(default_factory=RolloutConfig)
    model: HFModelConfig = field(default_factory=HFModelConfig)
    sandbox_fusion: SandboxFusionConfig = field(default_factory=SandboxFusionConfig)
    use_reward_loop: bool = True
    num_workers: int = 8

Import

from verl.workers.reward_model.base import BasePPORewardModel
from verl.workers.config.reward_model import RewardModelConfig

I/O Contract

Inputs (compute_reward)

Name Type Required Description
data DataProto Yes Batch data containing input_ids, attention_mask, and position_ids tensors
data.batch["input_ids"] torch.Tensor Yes Token IDs of shape (batch_size, sequence_length)
data.batch["attention_mask"] torch.Tensor Yes Attention mask of shape (batch_size, sequence_length)
data.batch["position_ids"] torch.Tensor Yes Position IDs of shape (batch_size, sequence_length)

Outputs (compute_reward)

Name Type Description
result DataProto DataProto containing the "reward" key
result.batch["reward"] torch.Tensor Reward scores of shape (batch_size, sequence_length); non-zero only at EOS positions

Configuration Keys

Config Key Type Description
reward_model.enable bool Enable learned reward model scoring (default: False)
reward_model.model.path str HuggingFace model ID or local path for the reward model
reward_model.enable_resource_pool bool Whether to use a separate resource pool for the reward model
algorithm.use_kl_in_reward bool Add KL penalty to the reward signal (default: False)
algorithm.kl_ctrl.kl_coef float KL penalty coefficient

Usage Examples

# Configuration (YAML)
# reward_model:
#   enable: True
#   model:
#     path: Skywork/Skywork-Reward-Llama-3.1-8B
#     trust_remote_code: False
#   inference:
#     tensor_model_parallel_size: 2
#     gpu_memory_utilization: 0.5
#
# algorithm:
#   use_kl_in_reward: True
#   kl_ctrl:
#     kl_coef: 0.02

# In the training loop, the reward model is invoked automatically:
# 1. Rollout generates responses
# 2. If reward_model.enable is True, the trainer calls:
#    reward_model_output = reward_model_wg.compute_reward(batch_data)
# 3. The reward tensor is merged into the training batch
# 4. If use_kl_in_reward is True, KL penalty is computed and added:
#    total_reward = rm_reward - kl_coef * kl_divergence

# Programmatic usage of the base class
from verl.workers.reward_model.base import BasePPORewardModel
from verl import DataProto

class MyRewardModel(BasePPORewardModel):
    def compute_reward(self, data: DataProto) -> DataProto:
        input_ids = data.batch["input_ids"]
        attention_mask = data.batch["attention_mask"]

        # Run transformer forward pass
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )

        # Extract reward at EOS positions
        reward = extract_eos_reward(outputs, attention_mask)
        return DataProto.from_dict({"reward": reward})

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment