Implementation:Allenai Open instruct Get Reward
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning from Human Feedback, Reward Modeling, Sequence Modeling |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for extracting scalar reward scores from a sequence classification reward model, handling variable-length padded sequences, provided by Open Instruct.
Description
The get_reward() function performs a forward pass through a reward model (an AutoModelForSequenceClassification with num_labels=1) and extracts the final scalar reward for each sequence in a batch. It handles the complexity of variable-length sequences within a padded batch by:
- Constructing proper attention masks and position IDs from the input token IDs and padding token ID.
- Running the transformer backbone (not the full model) to obtain the last hidden states, then applying the score head separately.
- Finding the last non-padding token position for each sequence using the
first_true_indiceshelper. - Gathering the per-sequence final scores by indexing into the per-token score tensor at the computed positions.
The function directly accesses the transformer backbone via model.base_model_prefix and calls the score head (model.score) separately, rather than using the model's standard forward() method. This provides more control over the intermediate outputs and avoids potential issues with different model architectures' forward methods.
Usage
Use this function whenever you need to compute reward scores from a reward model. It is called in:
- The reward model training loop (to compute chosen/rejected rewards for the Bradley-Terry loss).
- The reward model evaluation pipeline (to compute metrics on held-out preference data).
- RLHF policy training (to score generated completions for the RL objective).
Code Reference
Source Location
- Repository: Open Instruct
- File:
open_instruct/model_utils.py, lines 322-386
Signature
def get_reward(
model: torch.nn.Module,
query_responses: torch.Tensor,
pad_token_id: int,
context_length: int,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
Import
from open_instruct.model_utils import get_reward
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | torch.nn.Module | Yes | A pre-trained reward model (typically AutoModelForSequenceClassification with num_labels=1). Must have a base_model_prefix attribute pointing to the transformer backbone, and a score attribute for the classification head.
|
| query_responses | torch.Tensor | Yes | Tokenized input sequences of shape (batch_size, sequence_length). These are the concatenated prompt + completion token IDs. Shorter sequences should be right-padded with pad_token_id.
|
| pad_token_id | int | Yes | The token ID used for padding. Used to construct the attention mask and find sequence end positions. |
| context_length | int | Yes | The length of the prompt/context preceding the completion. Used when computing sequence lengths to find the first padding token after the context. Set to 0 for reward model training where the entire sequence is considered. |
Outputs
| Name | Type | Description |
|---|---|---|
| reward_logits | torch.Tensor | Per-token reward scores of shape (batch_size, sequence_length). Each position contains the score head's output for that token's hidden state. Note: this is squeezed from shape (batch_size, sequence_length, 1).
|
| final_scores | torch.Tensor | The scalar reward for each sequence, of shape (batch_size,). Obtained by indexing reward_logits at each sequence's last non-padding token position.
|
| sequence_lengths | torch.Tensor | The index of the last non-padding token for each sequence, of shape (batch_size,). Useful for downstream processing and debugging.
|
Usage Examples
Basic Usage
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from open_instruct.model_utils import get_reward
# Load reward model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
"my-reward-model", num_labels=1
)
tokenizer = AutoTokenizer.from_pretrained("my-reward-model")
# Tokenize some text
texts = ["This is a good response.", "This is a bad response."]
inputs = tokenizer(texts, padding=True, return_tensors="pt")
# Get rewards
reward_logits, final_scores, seq_lengths = get_reward(
model, inputs["input_ids"], tokenizer.pad_token_id, context_length=0
)
# final_scores: tensor([0.42, -0.15]) (example values)
In Reward Model Training
import torch
import torch.nn.functional as F
from open_instruct.model_utils import get_reward
# Concatenate chosen and rejected sequences
query_responses = torch.cat(
(data["input_ids_chosen"], data["input_ids_rejected"]), dim=0
)
# Forward pass to get rewards
_, predicted_reward, _ = get_reward(
model, query_responses, tokenizer.pad_token_id, context_length=0
)
# Split into chosen and rejected rewards
chosen_reward = predicted_reward[:data["input_ids_chosen"].shape[0]]
rejected_reward = predicted_reward[data["input_ids_chosen"].shape[0]:]
# Compute Bradley-Terry loss
loss = -F.logsigmoid(chosen_reward - rejected_reward).mean()
Implementation Details
The function follows a specific sequence of operations:
Step 1: Attention mask construction
attention_mask = query_responses != pad_token_id
Step 2: Position ID computation
position_ids = attention_mask.cumsum(1) - attention_mask.long()
This computes an exclusive cumulative sum so that real tokens get contiguous position indices (0, 1, 2, ...) while padding tokens receive the index of the last real token (which is harmless since they are masked out).
Step 3: Input masking
input_ids = torch.masked_fill(query_responses, ~attention_mask, 0)
Padding tokens are replaced with 0 to avoid producing meaningful embeddings at padding positions.
Step 4: Backbone forward pass
lm_backbone = getattr(model, model.base_model_prefix)
output = lm_backbone(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
return_dict=True,
output_hidden_states=True,
use_cache=False,
)
Note: use_cache=False is explicitly set because some architectures (e.g., Mistral-based models) error when cache is enabled with the sequence classification head.
Step 5: Score head application
reward_logits = model.score(output.hidden_states[-1])
The score head is applied to the last hidden states from the transformer backbone.
Step 6: Sequence length computation and final score extraction
sequence_lengths = first_true_indices(
query_responses[:, context_length:] == pad_token_id
) - 1 + context_length
Dependencies
| Package | Module | Purpose |
|---|---|---|
| torch | torch.nn | Model and tensor operations |
| torch | torch.Tensor | Input/output tensor type |
| open_instruct | model_utils.first_true_indices | Helper to find the index of the first True value in a boolean tensor (used for sequence length computation) |