Principle:Volcengine Verl Reward Model Scoring

Knowledge Sources	Training Language Models to Follow Instructions with Human Feedback Learning to Summarize from Human Feedback
Domains	Reinforcement_Learning, RLHF, Reward_Modeling
Last Updated	2026-02-07 14:00 GMT

Overview

A reward computation strategy that uses a trained neural network to score generated responses, providing learned preference signals for policy optimization.

Description

Reward Model Scoring uses a separately trained reward model to evaluate generated responses. The reward model is typically a transformer that has been fine-tuned on human preference data (e.g., chosen/rejected pairs) to predict which responses humans would prefer.

In the RLHF pipeline, the reward model acts as a proxy for human judgment:

It scores each generated completion with a scalar value
Higher scores indicate responses more aligned with human preferences
The scores are used as rewards in the RL training loop

verl supports running the reward model as a separate distributed worker, with optional KL penalty to prevent the policy from diverging too far from the reference model.

Usage

Use reward model scoring when:

Training for alignment tasks (helpfulness, harmlessness) where answers are subjective
Human preference data is available for training a reward model
Task quality cannot be verified with simple rules

Configure via reward_model.enable=True and reward_model.model.path=<model_id>.

Theoretical Basis

The reward model is trained on preference pairs:

$L_{R M} = - 𝔼_{(x, y_{w}, y_{l})} [\log σ (r_{ϕ} (x, y_{w}) - r_{ϕ} (x, y_{l}))]$

Where:

$r_{ϕ}$ is the reward model with parameters $ϕ$
$y_{w}$ is the preferred (chosen) response
$y_{l}$ is the rejected response
$σ$ is the sigmoid function

During RL training, the total reward may include a KL penalty:

$R_{t o t a l} (x, y) = r_{ϕ} (x, y) - β \cdot D_{K L} (π_{θ} (y | x) | | π_{r e f} (y | x))$

Pseudo-code:

# Abstract reward model scoring
reward_scores = reward_model.forward(prompt + response)  # scalar per response
if use_kl_penalty:
    kl = compute_kl(policy_log_probs, ref_log_probs)
    reward_scores = reward_scores - kl_coef * kl

Related Pages

Implemented By

Implementation:Volcengine_Verl_RewardModelWorker_Compute_Reward

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment