Principle:Volcengine Verl Rule Based Reward Computation
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Reward_Engineering, NLP |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
A reward computation strategy that uses deterministic functions (regex matching, string comparison) rather than learned models to score generated responses against ground truth answers.
Description
Rule-Based Reward Computation provides reward signals for RL training by comparing generated responses to known ground truth using deterministic rules. This approach is used when tasks have verifiable answers (e.g., math problems, multiple-choice questions) and a learned reward model is unnecessary.
Advantages over model-based rewards:
- Deterministic: Same input always produces the same reward (no noise from a reward model)
- No additional GPU memory: Does not require loading a separate reward model
- Faster: Simple string operations vs. neural network inference
- Interpretable: Easy to debug why a response received a particular score
The reward manager in verl supports a plugin architecture where custom reward functions can be registered per dataset via the data_source field in the data.
Usage
Use rule-based rewards when:
- The task has objectively verifiable answers (math, code execution, factual QA)
- Ground truth is available in the training data
- A reward model would be overkill or introduce unwanted noise
Configure via reward_model.style="rule" in the training config.
Theoretical Basis
Rule-based reward computation is a function:
Where:
- is the generated response
- is the ground truth answer
- is a deterministic function that returns a scalar reward
Common reward functions:
- Exact match:
- Format bonus: Additional reward for following expected output format
- Partial credit: Graded reward based on proximity to correct answer
Pseudo-code:
# Abstract rule-based reward
def compute_reward(response, ground_truth, data_source):
# Extract answer from response using dataset-specific regex
extracted = extract_answer(response, data_source)
# Compare with ground truth
if extracted == ground_truth:
return 1.0
else:
return 0.0