Principle:Volcengine Verl Reward Configuration Schema
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Reward_Engineering, Configuration |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
A standardized dictionary schema embedded in each data row that configures how rewards are computed during training, specifying either rule-based matching or learned reward model scoring.
Description
Reward Configuration Schema defines the reward_model field in each data row. This field tells the training pipeline how to compute rewards for generated responses. Two primary styles:
- Rule-based (
style="rule"): Uses deterministic functions to compare generated answers againstground_truth - Model-based (
style="model"): Uses a learned reward model to score responses
Additional fields may include:
eval: Evaluation method (e.g., "multiple_choice" for loglikelihood evaluation)choices: List of valid choices for multiple-choice tasks
Usage
Reward configuration is set during data preprocessing and consumed by the reward manager during training. The data_source field determines which reward function is used.
Theoretical Basis
Reward configuration is a simple schema pattern:
# Rule-based reward config
reward_config_rule = {
"style": "rule",
"ground_truth": "42" # Expected answer
}
# Model-based reward config
reward_config_model = {
"style": "model",
"ground_truth": "" # Not needed for learned RM
}
# Multiple-choice reward config
reward_config_mc = {
"style": "model",
"eval": "multiple_choice",
"ground_truth": 2, # Correct choice index
"choices": ["A", "B", "C", "D"]
}