Principle:Allenai Open instruct Reward Verification
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning Evaluation |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Reward verification is the process of computing reward signals for RL training by comparing model-generated outputs against ground truth labels using domain-specific verifier functions.
Description
In Reinforcement Learning from Verifiable Rewards (RLVR), the reward signal does not come from a learned reward model (as in RLHF) but from deterministic verification against ground truth. This approach is applicable to domains where correctness can be objectively assessed:
- Mathematics: Compare the model's final answer against the ground truth using mathematical equivalence checking. This handles different representations (e.g., "1/2" vs "0.5" vs "\frac{1}{2}").
- Code: Execute the model's generated code against test cases and check whether all tests pass.
- Instruction following: Verify that the response satisfies explicit constraints (e.g., "respond in exactly 3 sentences", "include the word 'hello'").
- LLM judge: Use a separate language model (e.g., GPT-4o-mini) to evaluate the quality of the response against the ground truth.
The verifier pattern follows a consistent interface:
- Each verifier is a callable that takes a tokenized prediction, decoded prediction, ground truth label, and optionally a query.
- It returns a reward score (typically 0 for incorrect, and a configurable positive value for correct).
- Verifiers are registered by name and mapped to datasets via the
verifier_sourcefield in each example.
Additionally, the system supports:
- Format rewards: Bonus rewards for responses that follow a specific format (e.g.,
<think>...</think><answer>...</answer>), encouraging structured reasoning. - Non-stop penalties: Negative rewards for responses that exceed the maximum length without generating a stop token.
- Reward remapping: Redirecting one dataset's verifier to use another's implementation.
Usage
Reward verification is invoked after each generation rollout, before advantages are computed. It is a core component of the GRPO pipeline that directly shapes the learning signal. The choice of verifiers and reward scales has a significant impact on training dynamics.
Theoretical Basis
The reward function in GRPO must satisfy several properties for stable training:
Binary reward signal: Most verifiers produce binary rewards (correct/incorrect). The magnitude is controlled by verification_reward (default: 10.0). Binary rewards simplify advantage computation:
For a group of K completions for the same prompt:
scores = [verifier(completion_k) for k in range(K)]
mean_score = mean(scores)
std_score = std(scores)
advantages = (scores - mean_score) / (std_score + epsilon)
Verifier composability: Multiple reward signals can be combined additively:
total_reward = 0
if apply_verifiable_reward:
total_reward += verification_reward * is_correct(response, ground_truth)
if apply_format_reward and additive_format_reward:
total_reward += format_reward * has_correct_format(response)
Filtering zero-variance groups: When all completions in a group receive the same reward (standard deviation = 0), the advantages are all zero. These groups contribute no gradient signal and can be filtered to save training compute. However, this filtering must be disabled when num_samples_per_prompt_rollout=1 (REINFORCE mode).