Heuristic:Huggingface Open r1 Reward Function Tuning

Knowledge Sources	Open R1 Kimi 1.5 Tech Report DAPO Paper Demystify Long CoT
Domains	Optimization, LLMs, Reinforcement_Learning
Last Updated	2026-02-08 00:00 GMT

Overview

Design heuristics for selecting, combining, and tuning reward functions in GRPO training to balance accuracy, format compliance, and length efficiency.

Description

Open-R1's GRPO training uses a registry of 14 reward functions that can be composed in arbitrary combinations. Each reward function captures a different signal: mathematical accuracy, formatting compliance, reasoning step count, length efficiency, repetition penalty, code correctness, and overlong punishment. The default reward combination is ["accuracy", "format", "tag_count"], but different tasks (math vs. code vs. competitive programming) require different reward combinations and tuning. This heuristic captures the tribal knowledge about which combinations work, what magic numbers are used, and why certain thresholds exist.

Usage

Use this heuristic when configuring GRPO training reward functions for a new task or tuning an existing training run. Apply when deciding which reward functions to combine, what thresholds to set for binary code scoring, and how to balance format compliance vs. accuracy.

The Insight (Rule of Thumb)

Action: Start with the default ["accuracy", "format", "tag_count"] for math tasks. For code tasks, use ["code", "code_format"] or ["binary_code", "code_format"].
Value: Key magic numbers in the codebase:
- Reasoning steps reward: normalize by dividing count by 3 (encourages at least 3 reasoning steps for partial reward).
- Binary code threshold: 0.99 (only rewards >99% test pass rate as binary 1.0).
- Cosine scaling defaults: correct answers reward range [0.5, 1.0], wrong answers range [0.0, -0.5], max length 1000 tokens.
- Repetition penalty: 3-gram default with max penalty of -1.0.
- Soft overlong punishment: starts penalizing at max_completion_len - soft_punish_cache (default: 16384 - 4096 = 12288 tokens).
Trade-off: More reward functions provide richer signal but increase compute cost per step. Format rewards can conflict with accuracy rewards on short completions.

Reasoning

The reward function design reflects several research papers:

Length reward (Kimi 1.5): The len_reward function implements the length-based reward from the Kimi 1.5 tech report, which discourages overthinking by computing 0.5 - (length - min_len) / (max_len - min_len) for correct answers and min(0, ...) for incorrect answers. This penalizes verbose wrong answers more than verbose correct ones.

Repetition penalty (Demystify Long CoT): The N-gram repetition penalty from Appendix C.2 of the Demystify Long CoT paper computes (1 - unique_ngrams / total_ngrams) * max_penalty. This prevents the model from falling into repetitive loops during long chain-of-thought reasoning.

Soft overlong punishment (DAPO): From Equation 13 of the DAPO paper, this provides a gradual penalty as completions approach the maximum length, rather than a hard cutoff.

Magic number 3 for reasoning steps: The threshold of 3 steps before giving full reward was chosen empirically to encourage the model to show at least a minimal reasoning chain (e.g., "Step 1: ..., Step 2: ..., Step 3: ...") without requiring an arbitrary minimum.

Binary threshold 0.99: The binary_code_reward uses 0.99 rather than 1.0 to account for floating-point precision in the test pass rate computation. This effectively means "all tests passed" while being robust to numerical edge cases.

Code Evidence

Magic number 3 for reasoning steps from src/open_r1/rewards.py:128-129:

# Magic number 3 to encourage 3 steps and more, otherwise partial reward
return [min(1.0, count / 3) for count in matches]

Binary threshold from src/open_r1/rewards.py:499:

BINARY_THRESHOLD = 0.99

Default reward function combination from src/open_r1/configs.py:234-236:

reward_funcs: list[str] = field(
    default_factory=lambda: ["accuracy", "format", "tag_count"],
    ...
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment