Principle:Allenai Open instruct Reward Extraction

Knowledge Sources	Learning to summarize from human feedback Training language models to follow instructions with human feedback Fine-Tuning Language Models from Human Preferences
Domains	Reinforcement Learning from Human Feedback, Reward Modeling, Sequence Modeling
Last Updated	2026-02-07 00:00 GMT

Overview

Reward extraction is the process of obtaining a single scalar reward score from a sequence classification model by selecting the score at the last non-padding token position in a variable-length sequence, ensuring that the reward reflects the model's assessment of the complete prompt-completion pair.

Description

A reward model based on a transformer sequence classification architecture produces a scalar output for every token position in the input sequence. However, only one of these per-position scores is meaningful as the reward for the entire sequence: the score at the position of the last actual (non-padding) token. This is analogous to how GPT-style models use the last token's hidden state for classification tasks.

The challenge arises because:

Variable-length sequences: Within a batch, different sequences may have different lengths. After padding to the maximum length, the position of the last real token varies per sequence.
Padding tokens are uninformative: The score head's output at padding positions is meaningless because those positions contain no real content. The model has never been trained to produce meaningful scores at padding token locations.
Context-dependent scoring: The reward should be based on the transformer's hidden state after processing the entire prompt and completion, which is captured at the last real token position due to the causal attention mask.

The extraction process involves:

Computing attention masks: Identifying which positions contain real tokens versus padding tokens.
Finding sequence end positions: For each sequence in the batch, determining the index of the last non-padding token.
Gathering per-sequence scores: Using advanced indexing to select the score at each sequence's end position from the full score tensor.

Usage

Use reward extraction whenever:

Obtaining scalar rewards from a sequence classification model for use in RL training (PPO, GRPO, etc.).
Evaluating a reward model on preference pairs (computing chosen/rejected scores).
Debugging reward model behavior by inspecting per-token and per-sequence reward values.
Any scenario where a single scalar score is needed from a transformer that produces per-token outputs.

Theoretical Basis

Autoregressive Reward Computation

In a causal (decoder-only) transformer, the hidden state at position $t$ aggregates information from all tokens at positions $\leq t$ . Therefore, the hidden state at the last token in a sequence contains the most complete representation of the entire input:

$h_{T} = Transformer (x_{1}, x_{2}, \dots, x_{T})$

The reward for the full sequence is then:

$r (x_{1 : T}) = W_{s} \cdot h_{T} + b_{s}$

Handling Variable-Length Sequences

Given a batch of $B$ sequences with varying lengths $T_{1}, T_{2}, \dots, T_{B}$ , padded to maximum length $T_{\max}$ , the reward extraction computes:

$r_{i} = scores [i, T_{i} - 1] for i = 1, \dots, B$

where $scores \in ℝ^{B \times T_{\max}}$ is the full per-token score tensor.

The sequence length $T_{i}$ is computed by finding the first padding token after any context prefix:

$T_{i} = \min {t \geq c : x_{i, t} = PAD} - 1$

where $c$ is the context length (prompt length) and $PAD$ is the padding token ID. If no padding token exists, $T_{i} = T_{\max} - 1$ .

Attention Mask and Position IDs

Proper reward extraction requires that the transformer processes the sequence with correct attention masks and position IDs:

Attention mask: $a_{i, t} = 𝟙 [x_{i, t} \neq PAD]$ ensures padding tokens are excluded from attention computation.
Position IDs: Computed as the exclusive cumulative sum of the attention mask: $p_{i, t} = \sum_{j = 0}^{t - 1} a_{i, j}$ , ensuring that real tokens receive contiguous position indices regardless of padding.
Masked input IDs: Padding tokens are replaced with zeros in the input to prevent the embedding layer from producing meaningful representations for padding positions.

Full Per-Token Score Tensor

While only the final score is used as the reward, the function also returns the full per-token score tensor $scores \in ℝ^{B \times T_{\max}}$ . This is useful for:

Debugging: Examining how per-token scores evolve across a sequence can reveal whether the model is attending to specific parts of the completion.
Token-level reward shaping: Some advanced RL methods use per-token rewards rather than a single end-of-sequence reward.

Related Pages

Implemented By

Implementation:Allenai_Open_instruct_Get_Reward

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment