Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Allenai Open instruct Reward Verification

From Leeroopedia


Knowledge Sources
Domains Reinforcement Learning Evaluation
Last Updated 2026-02-07 00:00 GMT

Overview

Reward verification is the process of computing reward signals for RL training by comparing model-generated outputs against ground truth labels using domain-specific verifier functions.

Description

In Reinforcement Learning from Verifiable Rewards (RLVR), the reward signal does not come from a learned reward model (as in RLHF) but from deterministic verification against ground truth. This approach is applicable to domains where correctness can be objectively assessed:

  • Mathematics: Compare the model's final answer against the ground truth using mathematical equivalence checking. This handles different representations (e.g., "1/2" vs "0.5" vs "\frac{1}{2}").
  • Code: Execute the model's generated code against test cases and check whether all tests pass.
  • Instruction following: Verify that the response satisfies explicit constraints (e.g., "respond in exactly 3 sentences", "include the word 'hello'").
  • LLM judge: Use a separate language model (e.g., GPT-4o-mini) to evaluate the quality of the response against the ground truth.

The verifier pattern follows a consistent interface:

  1. Each verifier is a callable that takes a tokenized prediction, decoded prediction, ground truth label, and optionally a query.
  2. It returns a reward score (typically 0 for incorrect, and a configurable positive value for correct).
  3. Verifiers are registered by name and mapped to datasets via the verifier_source field in each example.

Additionally, the system supports:

  • Format rewards: Bonus rewards for responses that follow a specific format (e.g., <think>...</think><answer>...</answer>), encouraging structured reasoning.
  • Non-stop penalties: Negative rewards for responses that exceed the maximum length without generating a stop token.
  • Reward remapping: Redirecting one dataset's verifier to use another's implementation.

Usage

Reward verification is invoked after each generation rollout, before advantages are computed. It is a core component of the GRPO pipeline that directly shapes the learning signal. The choice of verifiers and reward scales has a significant impact on training dynamics.

Theoretical Basis

The reward function in GRPO must satisfy several properties for stable training:

Binary reward signal: Most verifiers produce binary rewards (correct/incorrect). The magnitude is controlled by verification_reward (default: 10.0). Binary rewards simplify advantage computation:

For a group of K completions for the same prompt:
    scores = [verifier(completion_k) for k in range(K)]
    mean_score = mean(scores)
    std_score = std(scores)
    advantages = (scores - mean_score) / (std_score + epsilon)

Verifier composability: Multiple reward signals can be combined additively:

total_reward = 0
if apply_verifiable_reward:
    total_reward += verification_reward * is_correct(response, ground_truth)
if apply_format_reward and additive_format_reward:
    total_reward += format_reward * has_correct_format(response)

Filtering zero-variance groups: When all completions in a group receive the same reward (standard deviation = 0), the advantages are all zero. These groups contribute no gradient signal and can be filtered to save training compute. However, this filtering must be disabled when num_samples_per_prompt_rollout=1 (REINFORCE mode).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment