Principle:Volcengine Verl Rule Based Reward Computation

Knowledge Sources	DeepSeekMath: Pushing the Limits of Mathematical Reasoning verl Reward Documentation
Domains	Reinforcement_Learning, Reward_Engineering, NLP
Last Updated	2026-02-07 14:00 GMT

Overview

A reward computation strategy that uses deterministic functions (regex matching, string comparison) rather than learned models to score generated responses against ground truth answers.

Description

Rule-Based Reward Computation provides reward signals for RL training by comparing generated responses to known ground truth using deterministic rules. This approach is used when tasks have verifiable answers (e.g., math problems, multiple-choice questions) and a learned reward model is unnecessary.

Advantages over model-based rewards:

Deterministic: Same input always produces the same reward (no noise from a reward model)
No additional GPU memory: Does not require loading a separate reward model
Faster: Simple string operations vs. neural network inference
Interpretable: Easy to debug why a response received a particular score

The reward manager in verl supports a plugin architecture where custom reward functions can be registered per dataset via the data_source field in the data.

Usage

Use rule-based rewards when:

The task has objectively verifiable answers (math, code execution, factual QA)
Ground truth is available in the training data
A reward model would be overkill or introduce unwanted noise

Configure via reward_model.style="rule" in the training config.

Theoretical Basis

Rule-based reward computation is a function:

$R (y, y^{*}) = f_{r u l e} (y, y^{*})$

Where:

$y$ is the generated response
$y^{*}$ is the ground truth answer
$f_{r u l e}$ is a deterministic function that returns a scalar reward

Common reward functions:

Exact match: $R = 𝟙 [e x t r a c t (y) = y^{*}]$
Format bonus: Additional reward for following expected output format
Partial credit: Graded reward based on proximity to correct answer

Pseudo-code:

# Abstract rule-based reward
def compute_reward(response, ground_truth, data_source):
    # Extract answer from response using dataset-specific regex
    extracted = extract_answer(response, data_source)
    # Compare with ground truth
    if extracted == ground_truth:
        return 1.0
    else:
        return 0.0

Related Pages

Implemented By

Implementation:Volcengine_Verl_RewardManager_Compute_Reward

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment