Principle:Allenai Open instruct Reward Extraction
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning from Human Feedback, Reward Modeling, Sequence Modeling |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Reward extraction is the process of obtaining a single scalar reward score from a sequence classification model by selecting the score at the last non-padding token position in a variable-length sequence, ensuring that the reward reflects the model's assessment of the complete prompt-completion pair.
Description
A reward model based on a transformer sequence classification architecture produces a scalar output for every token position in the input sequence. However, only one of these per-position scores is meaningful as the reward for the entire sequence: the score at the position of the last actual (non-padding) token. This is analogous to how GPT-style models use the last token's hidden state for classification tasks.
The challenge arises because:
- Variable-length sequences: Within a batch, different sequences may have different lengths. After padding to the maximum length, the position of the last real token varies per sequence.
- Padding tokens are uninformative: The score head's output at padding positions is meaningless because those positions contain no real content. The model has never been trained to produce meaningful scores at padding token locations.
- Context-dependent scoring: The reward should be based on the transformer's hidden state after processing the entire prompt and completion, which is captured at the last real token position due to the causal attention mask.
The extraction process involves:
- Computing attention masks: Identifying which positions contain real tokens versus padding tokens.
- Finding sequence end positions: For each sequence in the batch, determining the index of the last non-padding token.
- Gathering per-sequence scores: Using advanced indexing to select the score at each sequence's end position from the full score tensor.
Usage
Use reward extraction whenever:
- Obtaining scalar rewards from a sequence classification model for use in RL training (PPO, GRPO, etc.).
- Evaluating a reward model on preference pairs (computing chosen/rejected scores).
- Debugging reward model behavior by inspecting per-token and per-sequence reward values.
- Any scenario where a single scalar score is needed from a transformer that produces per-token outputs.
Theoretical Basis
Autoregressive Reward Computation
In a causal (decoder-only) transformer, the hidden state at position aggregates information from all tokens at positions . Therefore, the hidden state at the last token in a sequence contains the most complete representation of the entire input:
The reward for the full sequence is then:
Handling Variable-Length Sequences
Given a batch of sequences with varying lengths , padded to maximum length , the reward extraction computes:
where is the full per-token score tensor.
The sequence length is computed by finding the first padding token after any context prefix:
where is the context length (prompt length) and is the padding token ID. If no padding token exists, .
Attention Mask and Position IDs
Proper reward extraction requires that the transformer processes the sequence with correct attention masks and position IDs:
- Attention mask: ensures padding tokens are excluded from attention computation.
- Position IDs: Computed as the exclusive cumulative sum of the attention mask: , ensuring that real tokens receive contiguous position indices regardless of padding.
- Masked input IDs: Padding tokens are replaced with zeros in the input to prevent the embedding layer from producing meaningful representations for padding positions.
Full Per-Token Score Tensor
While only the final score is used as the reward, the function also returns the full per-token score tensor . This is useful for:
- Debugging: Examining how per-token scores evolve across a sequence can reveal whether the model is attending to specific parts of the completion.
- Token-level reward shaping: Some advanced RL methods use per-token rewards rather than a single end-of-sequence reward.