Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Volcengine Verl Reward Configuration Schema

From Leeroopedia
Revision as of 17:38, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Volcengine_Verl_Reward_Configuration_Schema.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data_Engineering, Reward_Engineering, Configuration
Last Updated 2026-02-07 14:00 GMT

Overview

A standardized dictionary schema embedded in each data row that configures how rewards are computed during training, specifying either rule-based matching or learned reward model scoring.

Description

Reward Configuration Schema defines the reward_model field in each data row. This field tells the training pipeline how to compute rewards for generated responses. Two primary styles:

  • Rule-based (style="rule"): Uses deterministic functions to compare generated answers against ground_truth
  • Model-based (style="model"): Uses a learned reward model to score responses

Additional fields may include:

  • eval: Evaluation method (e.g., "multiple_choice" for loglikelihood evaluation)
  • choices: List of valid choices for multiple-choice tasks

Usage

Reward configuration is set during data preprocessing and consumed by the reward manager during training. The data_source field determines which reward function is used.

Theoretical Basis

Reward configuration is a simple schema pattern:

# Rule-based reward config
reward_config_rule = {
    "style": "rule",
    "ground_truth": "42"  # Expected answer
}

# Model-based reward config
reward_config_model = {
    "style": "model",
    "ground_truth": ""  # Not needed for learned RM
}

# Multiple-choice reward config
reward_config_mc = {
    "style": "model",
    "eval": "multiple_choice",
    "ground_truth": 2,   # Correct choice index
    "choices": ["A", "B", "C", "D"]
}

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment