Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Volcengine Verl Reward Model Scoring

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, RLHF, Reward_Modeling
Last Updated 2026-02-07 14:00 GMT

Overview

A reward computation strategy that uses a trained neural network to score generated responses, providing learned preference signals for policy optimization.

Description

Reward Model Scoring uses a separately trained reward model to evaluate generated responses. The reward model is typically a transformer that has been fine-tuned on human preference data (e.g., chosen/rejected pairs) to predict which responses humans would prefer.

In the RLHF pipeline, the reward model acts as a proxy for human judgment:

  • It scores each generated completion with a scalar value
  • Higher scores indicate responses more aligned with human preferences
  • The scores are used as rewards in the RL training loop

verl supports running the reward model as a separate distributed worker, with optional KL penalty to prevent the policy from diverging too far from the reference model.

Usage

Use reward model scoring when:

  • Training for alignment tasks (helpfulness, harmlessness) where answers are subjective
  • Human preference data is available for training a reward model
  • Task quality cannot be verified with simple rules

Configure via reward_model.enable=True and reward_model.model.path=<model_id>.

Theoretical Basis

The reward model is trained on preference pairs:

LRM=𝔼(x,yw,yl)[logσ(rϕ(x,yw)rϕ(x,yl))]

Where:

  • rϕ is the reward model with parameters ϕ
  • yw is the preferred (chosen) response
  • yl is the rejected response
  • σ is the sigmoid function

During RL training, the total reward may include a KL penalty:

Rtotal(x,y)=rϕ(x,y)βDKL(πθ(y|x)||πref(y|x))

Pseudo-code:

# Abstract reward model scoring
reward_scores = reward_model.forward(prompt + response)  # scalar per response
if use_kl_penalty:
    kl = compute_kl(policy_log_probs, ref_log_probs)
    reward_scores = reward_scores - kl_coef * kl

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment