Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenRLHF OpenRLHF Math reward func

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Reward_Modeling
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for computing rule-based math rewards from generated solutions provided by OpenRLHF.

Description

The math_reward_func function (or equivalent in the math reward example script) extracts answers from model-generated solutions, compares them to ground truth labels, and returns binary rewards. It supports answer extraction from boxed LaTeX format and various number formats.

This is a Pattern Doc - users implement their own reward functions following this interface.

Usage

Used as the reward function in Math-GRPO training. Users define their own function matching this interface.

Code Reference

Source Location

  • Repository: OpenRLHF
  • File: examples/scripts/train_ppo_llama_ray_math.sh (reference)

Interface Specification

def math_reward_func(
    queries: list[str],       # Input prompts
    responses: list[str],     # Generated responses
    labels: list[str],        # Ground truth answers
) -> list[float]:
    """
    Compute rewards for math problem solutions.

    Args:
        queries: List of math problem prompts
        responses: List of model-generated solutions
        labels: List of correct answers

    Returns:
        List of float rewards (typically 0.0 or 1.0)
    """
    rewards = []
    for response, label in zip(responses, labels):
        answer = extract_answer(response)  # Extract from \boxed{...} etc.
        if answer == normalize(label):
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

I/O Contract

Inputs

Name Type Required Description
queries List[str] Yes Math problem prompts
responses List[str] Yes Generated solutions
labels List[str] Yes Ground truth answers

Outputs

Name Type Description
rewards List[float] Binary rewards (0.0 or 1.0)

Usage Examples

# User-defined math reward function
def my_math_reward(queries, responses, labels):
    import re
    rewards = []
    for resp, label in zip(responses, labels):
        # Extract answer from \boxed{...}
        match = re.search(r'\\boxed\{(.+?)\}', resp)
        if match and match.group(1).strip() == label.strip():
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment