Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct RM Evaluate

From Leeroopedia


Knowledge Sources
Domains Reinforcement Learning from Human Feedback, Reward Modeling, Model Evaluation
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for evaluating a trained reward model on held-out preference data, computing accuracy, loss, and score distribution metrics, provided by Open Instruct.

Description

The evaluate() function performs a complete evaluation pass over a preference dataset using a trained reward model. For each batch, it:

  1. Concatenates chosen and rejected sequences into a single tensor for efficient forward pass.
  2. Calls get_reward() to obtain scalar rewards for all sequences.
  3. Splits the rewards back into chosen and rejected groups.
  4. Computes pairwise accuracy (fraction where chosen reward exceeds rejected reward).
  5. Computes Bradley-Terry loss using -F.logsigmoid(chosen - rejected).mean().
  6. Accumulates chosen reward means, rejected reward means, and reward margins.
  7. Optionally collects sample-level text data (shared prompts, chosen/rejected responses, scores, and correctness) for qualitative analysis.

The function returns two objects: a dictionary of aggregated metrics and an optional dictionary of sample-level data suitable for display in Weights & Biases tables or Rich console tables.

The evaluation is performed entirely under torch.no_grad() to save memory and computation. The model is set to evaluation mode at the start and restored to training mode at the end.

Usage

Use this function for periodic evaluation during reward model training, for final evaluation of a trained checkpoint, or for comparing multiple reward model variants on a common evaluation set.

Code Reference

Source Location

  • Repository: Open Instruct
  • File: open_instruct/reward_modeling_eval.py, lines 32-97

Signature

def evaluate(
    model: PreTrainedModel,
    dataloader: DataLoader,
    tokenizer: PreTrainedTokenizer,
    max_sampled_texts: int = 0,
) -> tuple[dict, dict]:

Import

from open_instruct.reward_modeling_eval import evaluate

I/O Contract

Inputs

Name Type Required Description
model PreTrainedModel Yes A trained reward model (AutoModelForSequenceClassification with num_labels=1). Can be wrapped by DeepSpeed or Accelerate.
dataloader DataLoader Yes A PyTorch DataLoader yielding batches of preference pairs. Each batch must contain input_ids_chosen and input_ids_rejected keys (as produced by SimplePreferenceCollator).
tokenizer PreTrainedTokenizer Yes The tokenizer associated with the model. Used to decode token IDs into text for the sample table and to access pad_token_id for reward extraction.
max_sampled_texts int No Maximum number of sample-level text examples to collect for qualitative analysis. Set to 0 (default) to skip text collection. Set to a positive integer to collect up to that many examples.

Outputs

Name Type Description
metrics_dict dict[str, float] Dictionary containing aggregated evaluation metrics, averaged over all batches.
sample_table dict[str, list] or None Dictionary of sample-level data for qualitative analysis. None if max_sampled_texts=0.

Metrics Dictionary Keys

Key Description
eval/rm/accuracy Fraction of preference pairs where the model correctly assigns a higher reward to the chosen completion.
eval/rm/loss Average Bradley-Terry loss across the evaluation dataset.
eval/rm/chosen_rewards Average reward score assigned to chosen completions.
eval/rm/rejected_rewards Average reward score assigned to rejected completions.
eval/rm/reward_margin Average difference between chosen and rejected rewards: 𝔼[r(yw)r(yl)].

Sample Table Keys

Key Description
shared prompt text The prompt text shared between chosen and rejected completions.
chosen response text The portion of the chosen completion that differs from the rejected completion.
rejected response text The portion of the rejected completion that differs from the chosen completion.
chosen reward, rejected reward A list of [chosen_score, rejected_score] rounded to 4 decimal places.
correct prediction Boolean indicating whether the model correctly ranked the chosen completion higher.

Usage Examples

Basic Usage

from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from open_instruct.reward_modeling_eval import evaluate
from open_instruct.dataset_transformation import SimplePreferenceCollator

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
    "my-reward-model", num_labels=1
)
tokenizer = AutoTokenizer.from_pretrained("my-reward-model")

# Create evaluation dataloader
collator = SimplePreferenceCollator(pad_token_id=tokenizer.pad_token_id)
eval_dataloader = DataLoader(eval_dataset, batch_size=8, collate_fn=collator)

# Run evaluation
metrics, sample_table = evaluate(
    model, eval_dataloader, tokenizer, max_sampled_texts=10
)

print(f"Accuracy: {metrics['eval/rm/accuracy']:.4f}")
print(f"Loss: {metrics['eval/rm/loss']:.4f}")
print(f"Reward margin: {metrics['eval/rm/reward_margin']:.4f}")

During Training (from reward_modeling.py)

from open_instruct.reward_modeling_eval import evaluate

# Periodic evaluation within the training loop
if training_step % eval_freq == 0 and eval_dataloader is not None:
    eval_metrics, table = evaluate(
        model, eval_dataloader, tokenizer, max_sampled_texts=10
    )
    # Gather sample texts across processes for distributed training
    for key in table:
        table[key] = gather_object(table[key])

    # Log to W&B or print to console
    if accelerator.is_main_process:
        print_rich_single_line_metrics(eval_metrics)
        if args.with_tracking:
            wandb.log({
                "preference_sample_texts": wandb.Table(
                    dataframe=pd.DataFrame(table)
                )
            })

Standalone Evaluation Script

"""
The reward_modeling_eval.py file includes a __main__ block
that demonstrates standalone evaluation:
"""
from transformers import AutoModelForSequenceClassification
from open_instruct.dataset_transformation import (
    TokenizerConfig, SimplePreferenceCollator,
    get_cached_dataset_tulu,
    CHOSEN_INPUT_IDS_KEY, REJECTED_INPUT_IDS_KEY,
)
from open_instruct.reward_modeling_eval import evaluate

model = AutoModelForSequenceClassification.from_pretrained(
    "EleutherAI/pythia-14m", num_labels=1
)
tc = TokenizerConfig(tokenizer_name_or_path="EleutherAI/pythia-14m")
tokenizer = tc.tokenizer
eval_dataset = get_cached_dataset_tulu(
    ["trl-internal-testing/sentiment-trl-style", "1.0"],
    ["test"], tc,
    ["preference_tokenize_v1", "preference_filter_v1"],
    [{}, {"max_token_length": 1024, "max_prompt_token_length": 512}],
    target_columns=[CHOSEN_INPUT_IDS_KEY, REJECTED_INPUT_IDS_KEY],
)
dataloader = DataLoader(
    eval_dataset, batch_size=8,
    collate_fn=SimplePreferenceCollator(tokenizer.pad_token_id)
)
metrics, table = evaluate(model, dataloader, tokenizer, max_sampled_texts=5)

Dependencies

Package Module Purpose
torch torch.nn.functional Bradley-Terry loss computation via F.logsigmoid
torch torch.no_grad Disabling gradient computation during evaluation
tqdm tqdm Progress bar for evaluation loop
transformers PreTrainedModel, PreTrainedTokenizer Type annotations and model/tokenizer interfaces
open_instruct model_utils.get_reward Extracting scalar rewards from the reward model
open_instruct dataset_transformation Dataset keys (CHOSEN_INPUT_IDS_KEY, REJECTED_INPUT_IDS_KEY)

Implementation Details

The function uses a helper find_shared_text(chosen_text, rejected_text) to extract the common prompt prefix from decoded chosen and rejected texts. This works by iterating character-by-character until the texts diverge, which is useful for displaying sample-level results where the prompt and response portions are shown separately.

The model is set to model.eval() at the start and model.train() at the end, ensuring that evaluation-specific behaviors (e.g., deterministic batch normalization, though unlikely in transformers) are active during the evaluation pass without affecting subsequent training.

Related Pages

Implements Principle

Related Implementations

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment