Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Open r1 Get Reward Funcs

From Leeroopedia


Metadata

Field Value
Source Repo (https://github.com/huggingface/open-r1)
Domains Reinforcement_Learning, NLP
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for resolving and configuring reward functions from a named registry provided by Open-R1.

Description

The get_reward_funcs function uses the REWARD_FUNCS_REGISTRY dictionary to resolve reward function names (strings) to callables. The registry contains 14 reward functions:

Name Type Description
accuracy Direct function Mathematical correctness verification via symbolic parsing
format Direct function Checks proper use of think/answer XML tags
reasoning_steps Direct function Scores presence of step-by-step reasoning structure
cosine Factory-generated Cosine-scaled length reward; shorter correct answers score higher
repetition_penalty Factory-generated N-gram diversity penalty; penalizes repetitive outputs
length Direct function Raw length-based reward
code Partial application Execution-based code correctness scoring
binary_code Partial application Binary pass/fail code execution scoring
ioi_code Partial application IOI-style competitive programming code scoring
cf_code Partial application Codeforces-style competitive programming code scoring
code_format Factory-generated Checks code output formatting conventions
tag_count Direct function Counts and scores proper tag usage
soft_overlong_punishment Factory-generated Soft penalty for outputs exceeding length threshold (DAPO)

Factory-generated functions (cosine, repetition_penalty, code_format, soft_overlong_punishment) use parameters from GRPOScriptArguments to configure their behavior at initialization time. Reward functions accept completions: list[list[dict]] and optional kwargs from dataset columns, returning list[float|None]. A return value of None signals that the sample should be skipped.

Usage

Import when setting up GRPO training to resolve reward function names from config to callable objects. The function reads the reward_funcs list from script_args and returns a corresponding list of configured callables ready for the GRPO trainer.

Code Reference

Source

Field Value
Repository open-r1
File src/open_r1/rewards.py
Lines L646-706

Signature

def get_reward_funcs(script_args) -> list[Callable]:
    REWARD_FUNCS_REGISTRY = {
        "accuracy": accuracy_reward,
        "format": format_reward,
        "reasoning_steps": reasoning_steps_reward,
        "cosine": get_cosine_scaled_reward(...),
        "repetition_penalty": get_repetition_penalty_reward(...),
        "length": len_reward,
        "code": partial(code_reward, ...),
        "binary_code": partial(binary_code_reward, ...),
        "ioi_code": partial(ioi_code_reward, ...),
        "cf_code": partial(cf_code_reward, ...),
        "code_format": get_code_format_reward(...),
        "tag_count": tag_count_reward,
        "soft_overlong_punishment": get_soft_overlong_punishment(...),
    }
    reward_funcs = [REWARD_FUNCS_REGISTRY[func] for func in script_args.reward_funcs]
    return reward_funcs

Import

from open_r1.rewards import get_reward_funcs

I/O Contract

Inputs

Parameter Type Required Description
script_args GRPOScriptArguments Yes Training script arguments containing reward_funcs (list of strings naming reward functions) plus configuration parameters for factory functions: cosine scaling bounds, repetition penalty n-gram size and max penalty, code execution settings, overlong punishment thresholds

Outputs

Type Description
list[Callable] None].

Usage Examples

from dataclasses import dataclass, field

@dataclass
class GRPOScriptArguments:
    reward_funcs: list[str] = field(default_factory=lambda: ["accuracy", "format"])
    cosine_min_len_value_wrong: float = 0.0
    cosine_max_len_value_wrong: float = -0.5
    cosine_min_len_value_correct: float = 1.0
    cosine_max_len_value_correct: float = 0.5
    cosine_min_len: int = 50
    cosine_max_len: int = 4000
    repetition_n_grams: int = 3
    repetition_max_penalty: float = -1.0
    code_language: str = "python"
    soft_overlong_max_length: int = 4096
    soft_overlong_penalty_scale: float = 1.0

# Example 1: Basic accuracy + format reward setup
script_args = GRPOScriptArguments(reward_funcs=["accuracy", "format"])
reward_funcs = get_reward_funcs(script_args)
# reward_funcs is now [accuracy_reward, format_reward]

# Example 2: Full multi-reward configuration
script_args = GRPOScriptArguments(
    reward_funcs=["accuracy", "format", "cosine", "repetition_penalty", "soft_overlong_punishment"]
)
reward_funcs = get_reward_funcs(script_args)
# reward_funcs contains 5 configured callables

# Example 3: Using resolved reward functions
completions = [[{"role": "assistant", "content": "<think>Step 1...</think><answer>42</answer>"}]]
for reward_fn in reward_funcs:
    scores = reward_fn(completions=completions, solution=["42"])
    print(scores)  # e.g., [1.0], [1.0], [0.85], [0.0], [0.0]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment