Implementation:Allenai Open instruct Get Reward

Knowledge Sources	Open Instruct HuggingFace Transformers GPT2 Classification
Domains	Reinforcement Learning from Human Feedback, Reward Modeling, Sequence Modeling
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for extracting scalar reward scores from a sequence classification reward model, handling variable-length padded sequences, provided by Open Instruct.

Description

The get_reward() function performs a forward pass through a reward model (an AutoModelForSequenceClassification with num_labels=1) and extracts the final scalar reward for each sequence in a batch. It handles the complexity of variable-length sequences within a padded batch by:

Constructing proper attention masks and position IDs from the input token IDs and padding token ID.
Running the transformer backbone (not the full model) to obtain the last hidden states, then applying the score head separately.
Finding the last non-padding token position for each sequence using the first_true_indices helper.
Gathering the per-sequence final scores by indexing into the per-token score tensor at the computed positions.

The function directly accesses the transformer backbone via model.base_model_prefix and calls the score head (model.score) separately, rather than using the model's standard forward() method. This provides more control over the intermediate outputs and avoids potential issues with different model architectures' forward methods.

Usage

Use this function whenever you need to compute reward scores from a reward model. It is called in:

The reward model training loop (to compute chosen/rejected rewards for the Bradley-Terry loss).
The reward model evaluation pipeline (to compute metrics on held-out preference data).
RLHF policy training (to score generated completions for the RL objective).

Code Reference

Source Location

Repository: Open Instruct
File: open_instruct/model_utils.py, lines 322-386

Signature

def get_reward(
    model: torch.nn.Module,
    query_responses: torch.Tensor,
    pad_token_id: int,
    context_length: int,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:

Import

from open_instruct.model_utils import get_reward

I/O Contract

Inputs

Name	Type	Required	Description
model	torch.nn.Module	Yes	A pre-trained reward model (typically `AutoModelForSequenceClassification` with `num_labels=1`). Must have a `base_model_prefix` attribute pointing to the transformer backbone, and a `score` attribute for the classification head.
query_responses	torch.Tensor	Yes	Tokenized input sequences of shape `(batch_size, sequence_length)`. These are the concatenated prompt + completion token IDs. Shorter sequences should be right-padded with `pad_token_id`.
pad_token_id	int	Yes	The token ID used for padding. Used to construct the attention mask and find sequence end positions.
context_length	int	Yes	The length of the prompt/context preceding the completion. Used when computing sequence lengths to find the first padding token after the context. Set to 0 for reward model training where the entire sequence is considered.

Outputs

Name	Type	Description
reward_logits	torch.Tensor	Per-token reward scores of shape `(batch_size, sequence_length)`. Each position contains the score head's output for that token's hidden state. Note: this is squeezed from shape `(batch_size, sequence_length, 1)`.
final_scores	torch.Tensor	The scalar reward for each sequence, of shape `(batch_size,)`. Obtained by indexing `reward_logits` at each sequence's last non-padding token position.
sequence_lengths	torch.Tensor	The index of the last non-padding token for each sequence, of shape `(batch_size,)`. Useful for downstream processing and debugging.

Usage Examples

Basic Usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from open_instruct.model_utils import get_reward

# Load reward model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
    "my-reward-model", num_labels=1
)
tokenizer = AutoTokenizer.from_pretrained("my-reward-model")

# Tokenize some text
texts = ["This is a good response.", "This is a bad response."]
inputs = tokenizer(texts, padding=True, return_tensors="pt")

# Get rewards
reward_logits, final_scores, seq_lengths = get_reward(
    model, inputs["input_ids"], tokenizer.pad_token_id, context_length=0
)
# final_scores: tensor([0.42, -0.15])  (example values)

In Reward Model Training

import torch
import torch.nn.functional as F
from open_instruct.model_utils import get_reward

# Concatenate chosen and rejected sequences
query_responses = torch.cat(
    (data["input_ids_chosen"], data["input_ids_rejected"]), dim=0
)

# Forward pass to get rewards
_, predicted_reward, _ = get_reward(
    model, query_responses, tokenizer.pad_token_id, context_length=0
)

# Split into chosen and rejected rewards
chosen_reward = predicted_reward[:data["input_ids_chosen"].shape[0]]
rejected_reward = predicted_reward[data["input_ids_chosen"].shape[0]:]

# Compute Bradley-Terry loss
loss = -F.logsigmoid(chosen_reward - rejected_reward).mean()

Implementation Details

The function follows a specific sequence of operations:

Step 1: Attention mask construction

attention_mask = query_responses != pad_token_id

Step 2: Position ID computation

position_ids = attention_mask.cumsum(1) - attention_mask.long()

This computes an exclusive cumulative sum so that real tokens get contiguous position indices (0, 1, 2, ...) while padding tokens receive the index of the last real token (which is harmless since they are masked out).

Step 3: Input masking

input_ids = torch.masked_fill(query_responses, ~attention_mask, 0)

Padding tokens are replaced with 0 to avoid producing meaningful embeddings at padding positions.

Step 4: Backbone forward pass

lm_backbone = getattr(model, model.base_model_prefix)
output = lm_backbone(
    input_ids=input_ids,
    attention_mask=attention_mask,
    position_ids=position_ids,
    return_dict=True,
    output_hidden_states=True,
    use_cache=False,
)

Note: use_cache=False is explicitly set because some architectures (e.g., Mistral-based models) error when cache is enabled with the sequence classification head.

Step 5: Score head application

reward_logits = model.score(output.hidden_states[-1])

The score head is applied to the last hidden states from the transformer backbone.

Step 6: Sequence length computation and final score extraction

sequence_lengths = first_true_indices(
    query_responses[:, context_length:] == pad_token_id
) - 1 + context_length

Dependencies

Package	Module	Purpose
torch	torch.nn	Model and tensor operations
torch	torch.Tensor	Input/output tensor type
open_instruct	model_utils.first_true_indices	Helper to find the index of the first True value in a boolean tensor (used for sequence length computation)

Related Pages

Implements Principle

Principle:Allenai_Open_instruct_Reward_Extraction

Related Implementations

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment