Implementation:Volcengine Verl Reward Config Dict
| Field | Value |
|---|---|
| Knowledge Sources | verl source code, data preprocessing examples |
| Domains | Reward Configuration, Data Schema, Training Pipeline |
| Last Updated | 2026-02-07 |
Overview
Description
The reward_model configuration dict is a standardized data schema embedded in every training example that tells the verl training pipeline how to compute rewards for that example. It is stored as the "reward_model" field in each parquet row and is consumed by the reward computation subsystem during RL training.
The dict has a required "style" key that selects the reward computation strategy, and additional keys that vary by style:
- Rule-based (
"style": "rule") -- Used for tasks with deterministic correct answers (math problems, code evaluation). The"ground_truth"field contains the expected answer string. A custom reward function compares the model's output against this ground truth.
- Model-based (
"style": "model") -- Used for open-ended tasks (alignment, instruction following). A learned reward model scores the model's output. The"ground_truth"field may contain a reference response but is not directly compared.
- Multiple-choice evaluation (
"style": "model", "eval": "multiple_choice") -- Used for benchmarks like HellaSwag. The"choices"field contains a list of candidate completions, and"ground_truth"is the index of the correct choice. Evaluation uses log-likelihood comparison.
Usage
This dict is constructed during data preprocessing and stored in parquet format. The training pipeline reads it to determine the appropriate reward function and ground truth for each example.
Code Reference
| Field | Value |
|---|---|
| Rule-based Source | examples/data_preprocess/gsm8k.py, Lines 72-73
|
| Model-based Source | examples/data_preprocess/full_hh_rlhf.py, Lines 112-115
|
| Multiple-choice Source | examples/data_preprocess/hellaswag.py, Lines 73-78
|
| Pattern Type | Pure Python dict construction (no special import needed) |
I/O Contract
Inputs
| Field | Type | Required | Description |
|---|---|---|---|
style |
str |
Yes | Reward computation strategy: "rule" or "model".
|
ground_truth |
str or int |
Yes | The expected answer. A string for rule-based, an index for multiple-choice, or a reference response for model-based. |
eval |
str |
No | Evaluation method. Set to "multiple_choice" for log-likelihood evaluation.
|
choices |
list[str] |
No | Candidate completions for multiple-choice evaluation. |
Outputs
The reward config dict is not a function; it is a data structure consumed by the reward computation pipeline. It determines:
| Outcome | Description |
|---|---|
| Reward function selection | "rule" triggers rule-based comparison; "model" triggers reward model inference.
|
| Ground truth availability | Provides the reference answer or response for comparison. |
| Evaluation method | Determines whether to use generation-based or log-likelihood-based evaluation. |
Usage Examples
Rule-based reward (GSM8K math):
# From examples/data_preprocess/gsm8k.py, Lines 68-77
data = {
"data_source": "openai/gsm8k",
"prompt": [{"role": "user", "content": question}],
"ability": "math",
"reward_model": {
"style": "rule",
"ground_truth": "42", # extracted numeric answer
},
"extra_info": {"split": "train", "index": 0},
}
Model-based reward (HH-RLHF alignment):
# From examples/data_preprocess/full_hh_rlhf.py, Lines 108-117
data = {
"data_source": "Dahoas/full-hh-rlhf",
"prompt": [{"role": "user", "content": prompt}],
"ability": "alignment",
"reward_model": {
"style": "model",
"ground_truth": response, # reference response (not directly compared)
},
"extra_info": {"split": "train", "index": idx},
}
Multiple-choice reward (HellaSwag):
# From examples/data_preprocess/hellaswag.py, Lines 69-80
choices = [preprocess(ending) for ending in doc["endings"]]
gold = int(doc["label"])
data = {
"data_source": "Rowan/hellaswag",
"prompt": [{"role": "user", "content": query}],
"ability": "nlp",
"reward_model": {
"style": "model",
"eval": "multiple_choice", # use log-likelihood evaluation
"ground_truth": gold, # index of the correct choice
"choices": choices, # list of candidate completions
},
"extra_info": {"split": "train", "index": idx},
}
MATH dataset rule-based reward:
# From examples/data_preprocess/math_dataset.py, Lines 71-76
data = {
"data_source": "DigitalLearningGmbH/MATH-lighteval",
"prompt": [{"role": "user", "content": question}],
"ability": "math",
"reward_model": {
"style": "rule",
"ground_truth": solution, # extracted from \boxed{} notation
},
"extra_info": {"split": split, "index": idx},
}