Implementation:Allenai Open instruct RM Evaluate

Knowledge Sources	Open Instruct
Domains	Reinforcement Learning from Human Feedback, Reward Modeling, Model Evaluation
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for evaluating a trained reward model on held-out preference data, computing accuracy, loss, and score distribution metrics, provided by Open Instruct.

Description

The evaluate() function performs a complete evaluation pass over a preference dataset using a trained reward model. For each batch, it:

Concatenates chosen and rejected sequences into a single tensor for efficient forward pass.
Calls get_reward() to obtain scalar rewards for all sequences.
Splits the rewards back into chosen and rejected groups.
Computes pairwise accuracy (fraction where chosen reward exceeds rejected reward).
Computes Bradley-Terry loss using -F.logsigmoid(chosen - rejected).mean().
Accumulates chosen reward means, rejected reward means, and reward margins.
Optionally collects sample-level text data (shared prompts, chosen/rejected responses, scores, and correctness) for qualitative analysis.

The function returns two objects: a dictionary of aggregated metrics and an optional dictionary of sample-level data suitable for display in Weights & Biases tables or Rich console tables.

The evaluation is performed entirely under torch.no_grad() to save memory and computation. The model is set to evaluation mode at the start and restored to training mode at the end.

Usage

Use this function for periodic evaluation during reward model training, for final evaluation of a trained checkpoint, or for comparing multiple reward model variants on a common evaluation set.

Code Reference

Source Location

Repository: Open Instruct
File: open_instruct/reward_modeling_eval.py, lines 32-97

Signature

def evaluate(
    model: PreTrainedModel,
    dataloader: DataLoader,
    tokenizer: PreTrainedTokenizer,
    max_sampled_texts: int = 0,
) -> tuple[dict, dict]:

Import

from open_instruct.reward_modeling_eval import evaluate

I/O Contract

Inputs

Name	Type	Required	Description
model	PreTrainedModel	Yes	A trained reward model (`AutoModelForSequenceClassification` with `num_labels=1`). Can be wrapped by DeepSpeed or Accelerate.
dataloader	DataLoader	Yes	A PyTorch DataLoader yielding batches of preference pairs. Each batch must contain `input_ids_chosen` and `input_ids_rejected` keys (as produced by `SimplePreferenceCollator`).
tokenizer	PreTrainedTokenizer	Yes	The tokenizer associated with the model. Used to decode token IDs into text for the sample table and to access `pad_token_id` for reward extraction.
max_sampled_texts	int	No	Maximum number of sample-level text examples to collect for qualitative analysis. Set to 0 (default) to skip text collection. Set to a positive integer to collect up to that many examples.

Outputs

Name	Type	Description
metrics_dict	dict[str, float]	Dictionary containing aggregated evaluation metrics, averaged over all batches.
sample_table	dict[str, list] or None	Dictionary of sample-level data for qualitative analysis. None if `max_sampled_texts=0`.

Metrics Dictionary Keys

Key	Description
eval/rm/accuracy	Fraction of preference pairs where the model correctly assigns a higher reward to the chosen completion.
eval/rm/loss	Average Bradley-Terry loss across the evaluation dataset.
eval/rm/chosen_rewards	Average reward score assigned to chosen completions.
eval/rm/rejected_rewards	Average reward score assigned to rejected completions.
eval/rm/reward_margin	Average difference between chosen and rejected rewards: $𝔼 [r (y_{w}) - r (y_{l})]$ .

Sample Table Keys

Key	Description
shared prompt text	The prompt text shared between chosen and rejected completions.
chosen response text	The portion of the chosen completion that differs from the rejected completion.
rejected response text	The portion of the rejected completion that differs from the chosen completion.
chosen reward, rejected reward	A list of `[chosen_score, rejected_score]` rounded to 4 decimal places.
correct prediction	Boolean indicating whether the model correctly ranked the chosen completion higher.

Usage Examples

Basic Usage

from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from open_instruct.reward_modeling_eval import evaluate
from open_instruct.dataset_transformation import SimplePreferenceCollator

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
    "my-reward-model", num_labels=1
)
tokenizer = AutoTokenizer.from_pretrained("my-reward-model")

# Create evaluation dataloader
collator = SimplePreferenceCollator(pad_token_id=tokenizer.pad_token_id)
eval_dataloader = DataLoader(eval_dataset, batch_size=8, collate_fn=collator)

# Run evaluation
metrics, sample_table = evaluate(
    model, eval_dataloader, tokenizer, max_sampled_texts=10
)

print(f"Accuracy: {metrics['eval/rm/accuracy']:.4f}")
print(f"Loss: {metrics['eval/rm/loss']:.4f}")
print(f"Reward margin: {metrics['eval/rm/reward_margin']:.4f}")

During Training (from reward_modeling.py)

from open_instruct.reward_modeling_eval import evaluate

# Periodic evaluation within the training loop
if training_step % eval_freq == 0 and eval_dataloader is not None:
    eval_metrics, table = evaluate(
        model, eval_dataloader, tokenizer, max_sampled_texts=10
    )
    # Gather sample texts across processes for distributed training
    for key in table:
        table[key] = gather_object(table[key])

    # Log to W&B or print to console
    if accelerator.is_main_process:
        print_rich_single_line_metrics(eval_metrics)
        if args.with_tracking:
            wandb.log({
                "preference_sample_texts": wandb.Table(
                    dataframe=pd.DataFrame(table)
                )
            })

Standalone Evaluation Script

"""
The reward_modeling_eval.py file includes a __main__ block
that demonstrates standalone evaluation:
"""
from transformers import AutoModelForSequenceClassification
from open_instruct.dataset_transformation import (
    TokenizerConfig, SimplePreferenceCollator,
    get_cached_dataset_tulu,
    CHOSEN_INPUT_IDS_KEY, REJECTED_INPUT_IDS_KEY,
)
from open_instruct.reward_modeling_eval import evaluate

model = AutoModelForSequenceClassification.from_pretrained(
    "EleutherAI/pythia-14m", num_labels=1
)
tc = TokenizerConfig(tokenizer_name_or_path="EleutherAI/pythia-14m")
tokenizer = tc.tokenizer
eval_dataset = get_cached_dataset_tulu(
    ["trl-internal-testing/sentiment-trl-style", "1.0"],
    ["test"], tc,
    ["preference_tokenize_v1", "preference_filter_v1"],
    [{}, {"max_token_length": 1024, "max_prompt_token_length": 512}],
    target_columns=[CHOSEN_INPUT_IDS_KEY, REJECTED_INPUT_IDS_KEY],
)
dataloader = DataLoader(
    eval_dataset, batch_size=8,
    collate_fn=SimplePreferenceCollator(tokenizer.pad_token_id)
)
metrics, table = evaluate(model, dataloader, tokenizer, max_sampled_texts=5)

Dependencies

Package	Module	Purpose
torch	torch.nn.functional	Bradley-Terry loss computation via `F.logsigmoid`
torch	torch.no_grad	Disabling gradient computation during evaluation
tqdm	tqdm	Progress bar for evaluation loop
transformers	PreTrainedModel, PreTrainedTokenizer	Type annotations and model/tokenizer interfaces
open_instruct	model_utils.get_reward	Extracting scalar rewards from the reward model
open_instruct	dataset_transformation	Dataset keys (`CHOSEN_INPUT_IDS_KEY`, `REJECTED_INPUT_IDS_KEY`)

Implementation Details

The function uses a helper find_shared_text(chosen_text, rejected_text) to extract the common prompt prefix from decoded chosen and rejected texts. This works by iterating character-by-character until the texts diverge, which is useful for displaying sample-level results where the prompt and response portions are shown separately.

The model is set to model.eval() at the start and model.train() at the end, ensuring that evaluation-specific behaviors (e.g., deterministic batch normalization, though unlikely in transformers) are active during the evaluation pass without affecting subsequent training.

Related Pages

Implements Principle

Principle:Allenai_Open_instruct_Reward_Model_Evaluation

Related Implementations

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment