Principle:OpenRLHF OpenRLHF Rule Based Reward Functions
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Reward_Modeling |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A reward computation pattern that uses deterministic rules (answer matching, format checking) instead of learned reward models to score RL generations.
Description
Rule-Based Reward Functions provide verifiable, deterministic reward signals for domains where correctness can be checked programmatically. For mathematical reasoning, rewards are typically binary (correct/incorrect answer extraction and comparison). This avoids reward model training and mitigates reward hacking, since the signal is ground-truth-based.
The pattern requires: (1) answer extraction from model output, (2) answer normalization, (3) comparison with ground truth, and (4) reward assignment.
Usage
Use for domains with verifiable answers (math, coding, logic). This is the reward mechanism for Math-GRPO training in OpenRLHF.
Theoretical Basis
Rule-based reward for math:
This creates a sparse binary reward signal. Combined with GRPO (no critic), only the policy gradient from the reward is used for learning.