Principle:Volcengine Verl Reward Model Scoring
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, RLHF, Reward_Modeling |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
A reward computation strategy that uses a trained neural network to score generated responses, providing learned preference signals for policy optimization.
Description
Reward Model Scoring uses a separately trained reward model to evaluate generated responses. The reward model is typically a transformer that has been fine-tuned on human preference data (e.g., chosen/rejected pairs) to predict which responses humans would prefer.
In the RLHF pipeline, the reward model acts as a proxy for human judgment:
- It scores each generated completion with a scalar value
- Higher scores indicate responses more aligned with human preferences
- The scores are used as rewards in the RL training loop
verl supports running the reward model as a separate distributed worker, with optional KL penalty to prevent the policy from diverging too far from the reference model.
Usage
Use reward model scoring when:
- Training for alignment tasks (helpfulness, harmlessness) where answers are subjective
- Human preference data is available for training a reward model
- Task quality cannot be verified with simple rules
Configure via reward_model.enable=True and reward_model.model.path=<model_id>.
Theoretical Basis
The reward model is trained on preference pairs:
Where:
- is the reward model with parameters
- is the preferred (chosen) response
- is the rejected response
- is the sigmoid function
During RL training, the total reward may include a KL penalty:
Pseudo-code:
# Abstract reward model scoring
reward_scores = reward_model.forward(prompt + response) # scalar per response
if use_kl_penalty:
kl = compute_kl(policy_log_probs, ref_log_probs)
reward_scores = reward_scores - kl_coef * kl