Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Unslothai Unsloth Reward Function Interface

From Leeroopedia
Revision as of 17:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Unslothai_Unsloth_Reward_Function_Interface.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Reinforcement_Learning, NLP
Last Updated 2026-02-07 00:00 GMT

Overview

Interface specification for user-defined reward functions in GRPO reinforcement learning training.

Description

Reward functions are user-defined Python callables that score model completions to guide reinforcement learning. This is a Pattern Doc — there is no library API to document. Instead, this page defines the interface contract that user-defined functions must satisfy to be compatible with TRL's GRPOTrainer and Unsloth's RL pipeline.

The GRPOTrainer calls each reward function with batches of prompts and completions, plus any additional columns from the training dataset passed as keyword arguments.

Usage

Define one or more reward functions matching the interface below. Pass them as a list to GRPOTrainer(reward_funcs=[...]). Each function is called during every training step to score rollout completions.

Interface Specification

def reward_function(
    prompts: list[str],       # Input prompts from the dataset
    completions: list[str],   # Model-generated completions
    **kwargs                  # Additional dataset columns (e.g., answer, nums)
) -> list[float]:             # Reward scores, one per completion
    """
    Scores model completions for reinforcement learning.

    Args:
        prompts: List of input prompts (batch).
        completions: List of model-generated completions (batch).
        **kwargs: Additional keyword arguments from dataset columns.
            For example, if the dataset has an "answer" column,
            it will be passed as answer=["42", "7", ...].

    Returns:
        List of float rewards, one per completion.
        Higher values indicate better completions.
        Typical range: [0.0, 1.0] or [-1.0, 1.0].
    """

I/O Contract

Inputs

Name Type Required Description
prompts list[str] Yes Batch of input prompts
completions list[str] Yes Batch of model-generated completions
**kwargs varies No Additional dataset columns passed by name

Outputs

Name Type Description
rewards list[float] Scalar reward scores, one per completion in the batch

Example Implementations

Correctness Reward (Math)

import re

def correctness_reward(prompts, completions, answer, **kwargs):
    """Checks if the model's final answer matches the expected answer."""
    rewards = []
    for completion, expected in zip(completions, answer):
        # Extract answer from \boxed{} format
        match = re.search(r"\\boxed\{(.+?)\}", completion)
        if match and match.group(1).strip() == str(expected).strip():
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

Format Reward (XML Tags)

def format_reward(prompts, completions, **kwargs):
    """Rewards completions that use proper XML-style reasoning tags."""
    rewards = []
    for completion in completions:
        score = 0.0
        if "<reasoning>" in completion and "</reasoning>" in completion:
            score += 0.5
        if "<answer>" in completion and "</answer>" in completion:
            score += 0.5
        rewards.append(score)
    return rewards

Length Penalty Reward

def length_reward(prompts, completions, **kwargs):
    """Penalizes very short or very long completions."""
    rewards = []
    for completion in completions:
        length = len(completion.split())
        if length < 10:
            rewards.append(-0.5)
        elif length > 500:
            rewards.append(-0.2)
        else:
            rewards.append(0.0)
    return rewards

Combining Multiple Rewards

from trl import GRPOTrainer, GRPOConfig

trainer = GRPOTrainer(
    model=model,
    reward_funcs=[
        correctness_reward,  # Primary signal
        format_reward,       # Structural compliance
        length_reward,       # Length regularization
    ],
    args=config,
    train_dataset=dataset,
)
# Each reward function is called independently;
# rewards are summed by the trainer

Code Reference

Source Location

  • Repository: unsloth
  • File: Pattern (user code); reference examples in tests/saving/language_models/test_save_merged_grpo_model.py (L1-825)

Import

# No import needed — user defines these functions directly
# Then passes to GRPOTrainer:
from trl import GRPOTrainer

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment