Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Volcengine Verl Reward Config Dict

From Leeroopedia


Field Value
Knowledge Sources verl source code, data preprocessing examples
Domains Reward Configuration, Data Schema, Training Pipeline
Last Updated 2026-02-07

Overview

Description

The reward_model configuration dict is a standardized data schema embedded in every training example that tells the verl training pipeline how to compute rewards for that example. It is stored as the "reward_model" field in each parquet row and is consumed by the reward computation subsystem during RL training.

The dict has a required "style" key that selects the reward computation strategy, and additional keys that vary by style:

  • Rule-based ("style": "rule") -- Used for tasks with deterministic correct answers (math problems, code evaluation). The "ground_truth" field contains the expected answer string. A custom reward function compares the model's output against this ground truth.
  • Model-based ("style": "model") -- Used for open-ended tasks (alignment, instruction following). A learned reward model scores the model's output. The "ground_truth" field may contain a reference response but is not directly compared.
  • Multiple-choice evaluation ("style": "model", "eval": "multiple_choice") -- Used for benchmarks like HellaSwag. The "choices" field contains a list of candidate completions, and "ground_truth" is the index of the correct choice. Evaluation uses log-likelihood comparison.

Usage

This dict is constructed during data preprocessing and stored in parquet format. The training pipeline reads it to determine the appropriate reward function and ground truth for each example.

Code Reference

Field Value
Rule-based Source examples/data_preprocess/gsm8k.py, Lines 72-73
Model-based Source examples/data_preprocess/full_hh_rlhf.py, Lines 112-115
Multiple-choice Source examples/data_preprocess/hellaswag.py, Lines 73-78
Pattern Type Pure Python dict construction (no special import needed)

I/O Contract

Inputs

Field Type Required Description
style str Yes Reward computation strategy: "rule" or "model".
ground_truth str or int Yes The expected answer. A string for rule-based, an index for multiple-choice, or a reference response for model-based.
eval str No Evaluation method. Set to "multiple_choice" for log-likelihood evaluation.
choices list[str] No Candidate completions for multiple-choice evaluation.

Outputs

The reward config dict is not a function; it is a data structure consumed by the reward computation pipeline. It determines:

Outcome Description
Reward function selection "rule" triggers rule-based comparison; "model" triggers reward model inference.
Ground truth availability Provides the reference answer or response for comparison.
Evaluation method Determines whether to use generation-based or log-likelihood-based evaluation.

Usage Examples

Rule-based reward (GSM8K math):

# From examples/data_preprocess/gsm8k.py, Lines 68-77
data = {
    "data_source": "openai/gsm8k",
    "prompt": [{"role": "user", "content": question}],
    "ability": "math",
    "reward_model": {
        "style": "rule",
        "ground_truth": "42",  # extracted numeric answer
    },
    "extra_info": {"split": "train", "index": 0},
}

Model-based reward (HH-RLHF alignment):

# From examples/data_preprocess/full_hh_rlhf.py, Lines 108-117
data = {
    "data_source": "Dahoas/full-hh-rlhf",
    "prompt": [{"role": "user", "content": prompt}],
    "ability": "alignment",
    "reward_model": {
        "style": "model",
        "ground_truth": response,  # reference response (not directly compared)
    },
    "extra_info": {"split": "train", "index": idx},
}

Multiple-choice reward (HellaSwag):

# From examples/data_preprocess/hellaswag.py, Lines 69-80
choices = [preprocess(ending) for ending in doc["endings"]]
gold = int(doc["label"])

data = {
    "data_source": "Rowan/hellaswag",
    "prompt": [{"role": "user", "content": query}],
    "ability": "nlp",
    "reward_model": {
        "style": "model",
        "eval": "multiple_choice",   # use log-likelihood evaluation
        "ground_truth": gold,         # index of the correct choice
        "choices": choices,           # list of candidate completions
    },
    "extra_info": {"split": "train", "index": idx},
}

MATH dataset rule-based reward:

# From examples/data_preprocess/math_dataset.py, Lines 71-76
data = {
    "data_source": "DigitalLearningGmbH/MATH-lighteval",
    "prompt": [{"role": "user", "content": question}],
    "ability": "math",
    "reward_model": {
        "style": "rule",
        "ground_truth": solution,  # extracted from \boxed{} notation
    },
    "extra_info": {"split": split, "index": idx},
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment