Principle:Unslothai Unsloth Reward Function Design

Knowledge Sources	DeepSeekMath Unsloth TRL Reward Functions
Domains	Reinforcement_Learning, NLP
Last Updated	2026-02-07 00:00 GMT

Overview

A design pattern for creating callable reward functions that score model-generated completions during reinforcement learning training to guide policy optimization.

Description

Reward function design is the critical engineering challenge in RL-based language model training. Reward functions evaluate model completions and return scalar scores that determine which generation strategies are reinforced. In GRPO, multiple reward functions can be combined, each scoring a different quality dimension.

Common reward function categories:

Correctness Rewards: Verify factual accuracy by comparing against ground-truth answers (e.g., math answers, code test cases).
Format Rewards: Check structural compliance (e.g., uses XML tags, follows CoT format, proper JSON output).
Length Rewards: Penalize or reward based on completion length to encourage conciseness or thoroughness.
Model-Based Rewards: Use a separate reward model to score quality (e.g., helpfulness, harmlessness).

The key design constraints are:

Deterministic and Fast: Reward functions run on every completion in every training batch. They must be efficient.
Differentiable Not Required: Only the scalar reward is used, not gradients through the reward function.
Composable: Multiple rewards are summed or weighted by the trainer.
Dataset Column Access: Reward functions receive extra keyword arguments matching dataset column names (e.g., answer, nums).

Usage

Define reward functions before creating the GRPOTrainer. Each function must accept prompts and completions as lists of strings and return a list of float scores. Additional dataset columns are passed as keyword arguments.

Theoretical Basis

The combined reward for a completion $o$ given prompt $q$ :

$R (o, q) = \sum_{k = 1}^{K} w_{k} \cdot r_{k} (o, q)$

Where $r_{k}$ are individual reward functions and $w_{k}$ are weights (typically all 1.0).

# Abstract reward function interface
def reward_function(
    prompts: list[str],       # Input prompts
    completions: list[str],   # Model-generated completions
    **kwargs                  # Additional dataset columns
) -> list[float]:             # Scalar rewards, one per completion
    ...

Good reward function design balances:

Signal density: Avoid sparse rewards (0/1 only); use partial credit where possible.
Reward scale: Keep rewards in a consistent range (e.g., [-1, 1] or [0, 1]).
Reward hacking prevention: Anticipate degenerate solutions the model might exploit.

Related Pages

Implemented By

Implementation:Unslothai_Unsloth_Reward_Function_Interface

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment