Implementation:Volcengine Verl RewardModelWorker Compute Reward
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Reward_Modeling |
| Type | API Doc |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
API documentation for the learned reward model scoring subsystem in verl, which uses a transformer-based reward model to score generated responses.
Description
The reward model worker subsystem provides neural reward scoring as an alternative (or complement) to rule-based reward functions. When enabled, a separate transformer model (typically a sequence classifier fine-tuned on preference data) scores each generated response. The model reads the concatenated prompt and response, and the scalar reward at the EOS token position is extracted. Optionally, a KL divergence penalty between the actor policy and a reference policy can be added to the reward signal to prevent the actor from deviating too far from the reference model, controlled by a KL coefficient.
The subsystem is built around the BasePPORewardModel abstract class, which defines the compute_reward interface. Concrete implementations include MegatronRewardModel for Megatron-based backends. The reward model is configured via RewardModelConfig and uses HFModelConfig for model architecture settings.
Usage
Enable the reward model by setting reward_model.enable=True in the configuration. Provide the model path via reward_model.model.path. When KL regularization is desired, set algorithm.use_kl_in_reward=True and algorithm.kl_ctrl.kl_coef to the desired coefficient. The reward model worker is managed by the resource pool and can be colocated with other workers.
Code Reference
Source Location
- Repository: verl
- File (base class): verl/workers/reward_model/base.py
- File (config): verl/workers/config/reward_model.py
- File (Megatron impl): verl/workers/reward_model/megatron/reward_model.py
Signature
class BasePPORewardModel(ABC):
"""Base class for reward model."""
def __init__(
self,
config: RewardModelConfig,
model_config: HFModelConfig,
device_mesh: DeviceMesh,
):
self.config = config
self.model_config = model_config
self.device_mesh = device_mesh
@abstractmethod
def compute_reward(self, data: DataProto) -> DataProto:
"""
Compute reward given input_ids.
Args:
data: Must contain keys "input_ids", "attention_mask", "position_ids".
- input_ids: [batch_size, sequence_length]
- attention_mask: [batch_size, sequence_length]
- position_ids: [batch_size, sequence_length]
Returns:
DataProto containing "reward" key.
- reward: [batch_size, sequence_length]
Only the [EOS] position contains the reward score.
"""
pass
@dataclass
class RewardModelConfig(BaseConfig):
enable: bool = False
enable_resource_pool: bool = False
n_gpus_per_node: int = 0
nnodes: int = 0
model_path: Optional[str] = None
inference: RolloutConfig = field(default_factory=RolloutConfig)
model: HFModelConfig = field(default_factory=HFModelConfig)
sandbox_fusion: SandboxFusionConfig = field(default_factory=SandboxFusionConfig)
use_reward_loop: bool = True
num_workers: int = 8
Import
from verl.workers.reward_model.base import BasePPORewardModel
from verl.workers.config.reward_model import RewardModelConfig
I/O Contract
Inputs (compute_reward)
| Name | Type | Required | Description |
|---|---|---|---|
| data | DataProto | Yes | Batch data containing input_ids, attention_mask, and position_ids tensors |
| data.batch["input_ids"] | torch.Tensor | Yes | Token IDs of shape (batch_size, sequence_length) |
| data.batch["attention_mask"] | torch.Tensor | Yes | Attention mask of shape (batch_size, sequence_length) |
| data.batch["position_ids"] | torch.Tensor | Yes | Position IDs of shape (batch_size, sequence_length) |
Outputs (compute_reward)
| Name | Type | Description |
|---|---|---|
| result | DataProto | DataProto containing the "reward" key |
| result.batch["reward"] | torch.Tensor | Reward scores of shape (batch_size, sequence_length); non-zero only at EOS positions |
Configuration Keys
| Config Key | Type | Description |
|---|---|---|
| reward_model.enable | bool | Enable learned reward model scoring (default: False) |
| reward_model.model.path | str | HuggingFace model ID or local path for the reward model |
| reward_model.enable_resource_pool | bool | Whether to use a separate resource pool for the reward model |
| algorithm.use_kl_in_reward | bool | Add KL penalty to the reward signal (default: False) |
| algorithm.kl_ctrl.kl_coef | float | KL penalty coefficient |
Usage Examples
# Configuration (YAML)
# reward_model:
# enable: True
# model:
# path: Skywork/Skywork-Reward-Llama-3.1-8B
# trust_remote_code: False
# inference:
# tensor_model_parallel_size: 2
# gpu_memory_utilization: 0.5
#
# algorithm:
# use_kl_in_reward: True
# kl_ctrl:
# kl_coef: 0.02
# In the training loop, the reward model is invoked automatically:
# 1. Rollout generates responses
# 2. If reward_model.enable is True, the trainer calls:
# reward_model_output = reward_model_wg.compute_reward(batch_data)
# 3. The reward tensor is merged into the training batch
# 4. If use_kl_in_reward is True, KL penalty is computed and added:
# total_reward = rm_reward - kl_coef * kl_divergence
# Programmatic usage of the base class
from verl.workers.reward_model.base import BasePPORewardModel
from verl import DataProto
class MyRewardModel(BasePPORewardModel):
def compute_reward(self, data: DataProto) -> DataProto:
input_ids = data.batch["input_ids"]
attention_mask = data.batch["attention_mask"]
# Run transformer forward pass
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
)
# Extract reward at EOS positions
reward = extract_eos_reward(outputs, attention_mask)
return DataProto.from_dict({"reward": reward})