Implementation:Volcengine Verl Reward Config Dict

Field	Value
Knowledge Sources	verl source code, data preprocessing examples
Domains	Reward Configuration, Data Schema, Training Pipeline
Last Updated	2026-02-07

Overview

Description

The reward_model configuration dict is a standardized data schema embedded in every training example that tells the verl training pipeline how to compute rewards for that example. It is stored as the "reward_model" field in each parquet row and is consumed by the reward computation subsystem during RL training.

The dict has a required "style" key that selects the reward computation strategy, and additional keys that vary by style:

Rule-based ("style": "rule") -- Used for tasks with deterministic correct answers (math problems, code evaluation). The "ground_truth" field contains the expected answer string. A custom reward function compares the model's output against this ground truth.

Model-based ("style": "model") -- Used for open-ended tasks (alignment, instruction following). A learned reward model scores the model's output. The "ground_truth" field may contain a reference response but is not directly compared.

Multiple-choice evaluation ("style": "model", "eval": "multiple_choice") -- Used for benchmarks like HellaSwag. The "choices" field contains a list of candidate completions, and "ground_truth" is the index of the correct choice. Evaluation uses log-likelihood comparison.

Usage

This dict is constructed during data preprocessing and stored in parquet format. The training pipeline reads it to determine the appropriate reward function and ground truth for each example.

Code Reference

Field	Value
Rule-based Source	`examples/data_preprocess/gsm8k.py`, Lines 72-73
Model-based Source	`examples/data_preprocess/full_hh_rlhf.py`, Lines 112-115
Multiple-choice Source	`examples/data_preprocess/hellaswag.py`, Lines 73-78
Pattern Type	Pure Python dict construction (no special import needed)

I/O Contract

Inputs

Field	Type	Required	Description
`style`	`str`	Yes	Reward computation strategy: `"rule"` or `"model"`.
`ground_truth`	`str` or `int`	Yes	The expected answer. A string for rule-based, an index for multiple-choice, or a reference response for model-based.
`eval`	`str`	No	Evaluation method. Set to `"multiple_choice"` for log-likelihood evaluation.
`choices`	`list[str]`	No	Candidate completions for multiple-choice evaluation.

Outputs

The reward config dict is not a function; it is a data structure consumed by the reward computation pipeline. It determines:

Outcome	Description
Reward function selection	`"rule"` triggers rule-based comparison; `"model"` triggers reward model inference.
Ground truth availability	Provides the reference answer or response for comparison.
Evaluation method	Determines whether to use generation-based or log-likelihood-based evaluation.

Usage Examples

Rule-based reward (GSM8K math):

# From examples/data_preprocess/gsm8k.py, Lines 68-77
data = {
    "data_source": "openai/gsm8k",
    "prompt": [{"role": "user", "content": question}],
    "ability": "math",
    "reward_model": {
        "style": "rule",
        "ground_truth": "42",  # extracted numeric answer
    },
    "extra_info": {"split": "train", "index": 0},
}

Model-based reward (HH-RLHF alignment):

# From examples/data_preprocess/full_hh_rlhf.py, Lines 108-117
data = {
    "data_source": "Dahoas/full-hh-rlhf",
    "prompt": [{"role": "user", "content": prompt}],
    "ability": "alignment",
    "reward_model": {
        "style": "model",
        "ground_truth": response,  # reference response (not directly compared)
    },
    "extra_info": {"split": "train", "index": idx},
}

Multiple-choice reward (HellaSwag):

# From examples/data_preprocess/hellaswag.py, Lines 69-80
choices = [preprocess(ending) for ending in doc["endings"]]
gold = int(doc["label"])

data = {
    "data_source": "Rowan/hellaswag",
    "prompt": [{"role": "user", "content": query}],
    "ability": "nlp",
    "reward_model": {
        "style": "model",
        "eval": "multiple_choice",   # use log-likelihood evaluation
        "ground_truth": gold,         # index of the correct choice
        "choices": choices,           # list of candidate completions
    },
    "extra_info": {"split": "train", "index": idx},
}

MATH dataset rule-based reward:

# From examples/data_preprocess/math_dataset.py, Lines 71-76
data = {
    "data_source": "DigitalLearningGmbH/MATH-lighteval",
    "prompt": [{"role": "user", "content": question}],
    "ability": "math",
    "reward_model": {
        "style": "rule",
        "ground_truth": solution,  # extracted from \boxed{} notation
    },
    "extra_info": {"split": split, "index": idx},
}

Related Pages

Principle:Volcengine_Verl_Reward_Configuration_Schema

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment