Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct Get Reward

From Leeroopedia


Knowledge Sources
Domains Reinforcement Learning from Human Feedback, Reward Modeling, Sequence Modeling
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for extracting scalar reward scores from a sequence classification reward model, handling variable-length padded sequences, provided by Open Instruct.

Description

The get_reward() function performs a forward pass through a reward model (an AutoModelForSequenceClassification with num_labels=1) and extracts the final scalar reward for each sequence in a batch. It handles the complexity of variable-length sequences within a padded batch by:

  1. Constructing proper attention masks and position IDs from the input token IDs and padding token ID.
  2. Running the transformer backbone (not the full model) to obtain the last hidden states, then applying the score head separately.
  3. Finding the last non-padding token position for each sequence using the first_true_indices helper.
  4. Gathering the per-sequence final scores by indexing into the per-token score tensor at the computed positions.

The function directly accesses the transformer backbone via model.base_model_prefix and calls the score head (model.score) separately, rather than using the model's standard forward() method. This provides more control over the intermediate outputs and avoids potential issues with different model architectures' forward methods.

Usage

Use this function whenever you need to compute reward scores from a reward model. It is called in:

  • The reward model training loop (to compute chosen/rejected rewards for the Bradley-Terry loss).
  • The reward model evaluation pipeline (to compute metrics on held-out preference data).
  • RLHF policy training (to score generated completions for the RL objective).

Code Reference

Source Location

  • Repository: Open Instruct
  • File: open_instruct/model_utils.py, lines 322-386

Signature

def get_reward(
    model: torch.nn.Module,
    query_responses: torch.Tensor,
    pad_token_id: int,
    context_length: int,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:

Import

from open_instruct.model_utils import get_reward

I/O Contract

Inputs

Name Type Required Description
model torch.nn.Module Yes A pre-trained reward model (typically AutoModelForSequenceClassification with num_labels=1). Must have a base_model_prefix attribute pointing to the transformer backbone, and a score attribute for the classification head.
query_responses torch.Tensor Yes Tokenized input sequences of shape (batch_size, sequence_length). These are the concatenated prompt + completion token IDs. Shorter sequences should be right-padded with pad_token_id.
pad_token_id int Yes The token ID used for padding. Used to construct the attention mask and find sequence end positions.
context_length int Yes The length of the prompt/context preceding the completion. Used when computing sequence lengths to find the first padding token after the context. Set to 0 for reward model training where the entire sequence is considered.

Outputs

Name Type Description
reward_logits torch.Tensor Per-token reward scores of shape (batch_size, sequence_length). Each position contains the score head's output for that token's hidden state. Note: this is squeezed from shape (batch_size, sequence_length, 1).
final_scores torch.Tensor The scalar reward for each sequence, of shape (batch_size,). Obtained by indexing reward_logits at each sequence's last non-padding token position.
sequence_lengths torch.Tensor The index of the last non-padding token for each sequence, of shape (batch_size,). Useful for downstream processing and debugging.

Usage Examples

Basic Usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from open_instruct.model_utils import get_reward

# Load reward model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
    "my-reward-model", num_labels=1
)
tokenizer = AutoTokenizer.from_pretrained("my-reward-model")

# Tokenize some text
texts = ["This is a good response.", "This is a bad response."]
inputs = tokenizer(texts, padding=True, return_tensors="pt")

# Get rewards
reward_logits, final_scores, seq_lengths = get_reward(
    model, inputs["input_ids"], tokenizer.pad_token_id, context_length=0
)
# final_scores: tensor([0.42, -0.15])  (example values)

In Reward Model Training

import torch
import torch.nn.functional as F
from open_instruct.model_utils import get_reward

# Concatenate chosen and rejected sequences
query_responses = torch.cat(
    (data["input_ids_chosen"], data["input_ids_rejected"]), dim=0
)

# Forward pass to get rewards
_, predicted_reward, _ = get_reward(
    model, query_responses, tokenizer.pad_token_id, context_length=0
)

# Split into chosen and rejected rewards
chosen_reward = predicted_reward[:data["input_ids_chosen"].shape[0]]
rejected_reward = predicted_reward[data["input_ids_chosen"].shape[0]:]

# Compute Bradley-Terry loss
loss = -F.logsigmoid(chosen_reward - rejected_reward).mean()

Implementation Details

The function follows a specific sequence of operations:

Step 1: Attention mask construction

attention_mask = query_responses != pad_token_id

Step 2: Position ID computation

position_ids = attention_mask.cumsum(1) - attention_mask.long()

This computes an exclusive cumulative sum so that real tokens get contiguous position indices (0, 1, 2, ...) while padding tokens receive the index of the last real token (which is harmless since they are masked out).

Step 3: Input masking

input_ids = torch.masked_fill(query_responses, ~attention_mask, 0)

Padding tokens are replaced with 0 to avoid producing meaningful embeddings at padding positions.

Step 4: Backbone forward pass

lm_backbone = getattr(model, model.base_model_prefix)
output = lm_backbone(
    input_ids=input_ids,
    attention_mask=attention_mask,
    position_ids=position_ids,
    return_dict=True,
    output_hidden_states=True,
    use_cache=False,
)

Note: use_cache=False is explicitly set because some architectures (e.g., Mistral-based models) error when cache is enabled with the sequence classification head.

Step 5: Score head application

reward_logits = model.score(output.hidden_states[-1])

The score head is applied to the last hidden states from the transformer backbone.

Step 6: Sequence length computation and final score extraction

sequence_lengths = first_true_indices(
    query_responses[:, context_length:] == pad_token_id
) - 1 + context_length

Dependencies

Package Module Purpose
torch torch.nn Model and tensor operations
torch torch.Tensor Input/output tensor type
open_instruct model_utils.first_true_indices Helper to find the index of the first True value in a boolean tensor (used for sequence length computation)

Related Pages

Implements Principle

Related Implementations

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment