Principle:OpenRLHF OpenRLHF Rule Based Reward Functions

Knowledge Sources	DeepSeek-R1 Let's Verify Step by Step
Domains	Reinforcement_Learning, Reward_Modeling
Last Updated	2026-02-07 00:00 GMT

Overview

A reward computation pattern that uses deterministic rules (answer matching, format checking) instead of learned reward models to score RL generations.

Description

Rule-Based Reward Functions provide verifiable, deterministic reward signals for domains where correctness can be checked programmatically. For mathematical reasoning, rewards are typically binary (correct/incorrect answer extraction and comparison). This avoids reward model training and mitigates reward hacking, since the signal is ground-truth-based.

The pattern requires: (1) answer extraction from model output, (2) answer normalization, (3) comparison with ground truth, and (4) reward assignment.

Usage

Use for domains with verifiable answers (math, coding, logic). This is the reward mechanism for Math-GRPO training in OpenRLHF.

Theoretical Basis

Rule-based reward for math: $R (x, y) = {\begin{cases} 1.0 & if extract (y) = answer (x) \\ 0.0 & otherwise \end{cases}$

This creates a sparse binary reward signal. Combined with GRPO (no critic), only the policy gradient from the reward is used for learning.

Related Pages

Implemented By

Implementation:OpenRLHF_OpenRLHF_Math_reward_func

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment