Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Huggingface Trl Reward Function Definition

From Leeroopedia
Revision as of 17:24, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Huggingface_Trl_Reward_Function_Definition.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Property Value
Principle Name Reward Function Definition
Library Huggingface TRL
Category Reward Engineering / Online RL
Paper DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Related Papers DAPO, DeepSeek-R1

Overview

Description

In online reinforcement learning for language models, the reward function is the central mechanism through which desired behavior is specified. Rather than providing explicit labels or demonstrations, the practitioner defines callable scoring functions that evaluate model-generated completions and return numerical rewards. These rewards drive the policy gradient update: completions with higher rewards receive higher advantages and are reinforced, while low-reward completions are suppressed.

The Reward Function Definition principle covers the design and implementation of reward functions in TRL's GRPO training pipeline. TRL provides a modular reward system where multiple reward functions can be composed together, each contributing a score that is weighted and aggregated into a single advantage signal.

Usage

Reward functions in TRL follow a consistent callable interface. They receive:

  • prompts: The original prompt texts
  • completions: The model-generated completion texts (conversational format with role/content dicts, or plain strings)
  • completion_ids: Token ID lists for each completion
  • Any additional columns from the dataset as keyword arguments (e.g., solution for ground-truth answers)
  • trainer_state: The current TrainerState for dynamic reward shaping

They return a list of floats (one per completion), where None values indicate the reward is not applicable for that sample (useful for multi-task training).

Theoretical Basis

Reward shaping is a well-established technique in reinforcement learning. In the context of LLM training, effective reward functions typically combine multiple signals:

Accuracy Verification: The most fundamental reward checks whether the model's answer matches a known ground truth. For mathematical reasoning, TRL provides accuracy_reward which uses the math_verify library to parse both the gold solution and the model's answer from LaTeX notation, then verifies mathematical equivalence. This is more robust than string matching since it handles equivalent representations (e.g., \frac{1}{2} and 0.5). When the gold solution cannot be parsed, the reward returns None to skip the example rather than assigning an incorrect score.

Reasoning Structure Verification: The reasoning_accuracy_reward variant first strips reasoning content (delimited by tags like <think>...</think>) and only verifies the final answer. Completions that lack the closing delimiter receive a reward of 0.0 rather than None, actively penalizing incomplete reasoning chains.

Format Compliance: The think_format_reward checks that completions follow the expected <think>...</think> structure using a regex pattern. This encourages the model to learn the chain-of-thought format alongside correct answers.

Length Penalties: The get_soft_overlong_punishment factory function creates a reward that penalizes completions exceeding a target length. Following Equation 13 from the DAPO paper, it applies a soft linear penalty in a cache region before the hard maximum, transitioning from 0.0 to -1.0 as the completion length approaches the limit.

Multi-Reward Composition: When multiple reward functions are used, their outputs are combined using configurable weights (reward_weights) and aggregation strategies (multi_objective_aggregation). The "sum_then_normalize" strategy first sums weighted rewards and then normalizes advantages within groups, while "normalize_then_sum" (from the GDPO paper) normalizes each reward function independently before summing.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment