Principle:Volcengine Verl Reward Configuration Schema

Knowledge Sources	verl
Domains	Data_Engineering, Reward_Engineering, Configuration
Last Updated	2026-02-07 14:00 GMT

Overview

A standardized dictionary schema embedded in each data row that configures how rewards are computed during training, specifying either rule-based matching or learned reward model scoring.

Description

Reward Configuration Schema defines the reward_model field in each data row. This field tells the training pipeline how to compute rewards for generated responses. Two primary styles:

Rule-based (style="rule"): Uses deterministic functions to compare generated answers against ground_truth
Model-based (style="model"): Uses a learned reward model to score responses

Additional fields may include:

eval: Evaluation method (e.g., "multiple_choice" for loglikelihood evaluation)
choices: List of valid choices for multiple-choice tasks

Usage

Reward configuration is set during data preprocessing and consumed by the reward manager during training. The data_source field determines which reward function is used.

Theoretical Basis

Reward configuration is a simple schema pattern:

# Rule-based reward config
reward_config_rule = {
    "style": "rule",
    "ground_truth": "42"  # Expected answer
}

# Model-based reward config
reward_config_model = {
    "style": "model",
    "ground_truth": ""  # Not needed for learned RM
}

# Multiple-choice reward config
reward_config_mc = {
    "style": "model",
    "eval": "multiple_choice",
    "ground_truth": 2,   # Correct choice index
    "choices": ["A", "B", "C", "D"]
}

Related Pages

Implemented By

Implementation:Volcengine_Verl_Reward_Config_Dict

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment