Principle:Unslothai Unsloth Reward Function Design
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, NLP |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A design pattern for creating callable reward functions that score model-generated completions during reinforcement learning training to guide policy optimization.
Description
Reward function design is the critical engineering challenge in RL-based language model training. Reward functions evaluate model completions and return scalar scores that determine which generation strategies are reinforced. In GRPO, multiple reward functions can be combined, each scoring a different quality dimension.
Common reward function categories:
- Correctness Rewards: Verify factual accuracy by comparing against ground-truth answers (e.g., math answers, code test cases).
- Format Rewards: Check structural compliance (e.g., uses XML tags, follows CoT format, proper JSON output).
- Length Rewards: Penalize or reward based on completion length to encourage conciseness or thoroughness.
- Model-Based Rewards: Use a separate reward model to score quality (e.g., helpfulness, harmlessness).
The key design constraints are:
- Deterministic and Fast: Reward functions run on every completion in every training batch. They must be efficient.
- Differentiable Not Required: Only the scalar reward is used, not gradients through the reward function.
- Composable: Multiple rewards are summed or weighted by the trainer.
- Dataset Column Access: Reward functions receive extra keyword arguments matching dataset column names (e.g., answer, nums).
Usage
Define reward functions before creating the GRPOTrainer. Each function must accept prompts and completions as lists of strings and return a list of float scores. Additional dataset columns are passed as keyword arguments.
Theoretical Basis
The combined reward for a completion given prompt :
Where are individual reward functions and are weights (typically all 1.0).
# Abstract reward function interface
def reward_function(
prompts: list[str], # Input prompts
completions: list[str], # Model-generated completions
**kwargs # Additional dataset columns
) -> list[float]: # Scalar rewards, one per completion
...
Good reward function design balances:
- Signal density: Avoid sparse rewards (0/1 only); use partial credit where possible.
- Reward scale: Keep rewards in a consistent range (e.g., [-1, 1] or [0, 1]).
- Reward hacking prevention: Anticipate degenerate solutions the model might exploit.