Implementation:Allenai Open instruct RM Evaluate
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning from Human Feedback, Reward Modeling, Model Evaluation |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for evaluating a trained reward model on held-out preference data, computing accuracy, loss, and score distribution metrics, provided by Open Instruct.
Description
The evaluate() function performs a complete evaluation pass over a preference dataset using a trained reward model. For each batch, it:
- Concatenates chosen and rejected sequences into a single tensor for efficient forward pass.
- Calls
get_reward()to obtain scalar rewards for all sequences. - Splits the rewards back into chosen and rejected groups.
- Computes pairwise accuracy (fraction where chosen reward exceeds rejected reward).
- Computes Bradley-Terry loss using
-F.logsigmoid(chosen - rejected).mean(). - Accumulates chosen reward means, rejected reward means, and reward margins.
- Optionally collects sample-level text data (shared prompts, chosen/rejected responses, scores, and correctness) for qualitative analysis.
The function returns two objects: a dictionary of aggregated metrics and an optional dictionary of sample-level data suitable for display in Weights & Biases tables or Rich console tables.
The evaluation is performed entirely under torch.no_grad() to save memory and computation. The model is set to evaluation mode at the start and restored to training mode at the end.
Usage
Use this function for periodic evaluation during reward model training, for final evaluation of a trained checkpoint, or for comparing multiple reward model variants on a common evaluation set.
Code Reference
Source Location
- Repository: Open Instruct
- File:
open_instruct/reward_modeling_eval.py, lines 32-97
Signature
def evaluate(
model: PreTrainedModel,
dataloader: DataLoader,
tokenizer: PreTrainedTokenizer,
max_sampled_texts: int = 0,
) -> tuple[dict, dict]:
Import
from open_instruct.reward_modeling_eval import evaluate
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | PreTrainedModel | Yes | A trained reward model (AutoModelForSequenceClassification with num_labels=1). Can be wrapped by DeepSpeed or Accelerate.
|
| dataloader | DataLoader | Yes | A PyTorch DataLoader yielding batches of preference pairs. Each batch must contain input_ids_chosen and input_ids_rejected keys (as produced by SimplePreferenceCollator).
|
| tokenizer | PreTrainedTokenizer | Yes | The tokenizer associated with the model. Used to decode token IDs into text for the sample table and to access pad_token_id for reward extraction.
|
| max_sampled_texts | int | No | Maximum number of sample-level text examples to collect for qualitative analysis. Set to 0 (default) to skip text collection. Set to a positive integer to collect up to that many examples. |
Outputs
| Name | Type | Description |
|---|---|---|
| metrics_dict | dict[str, float] | Dictionary containing aggregated evaluation metrics, averaged over all batches. |
| sample_table | dict[str, list] or None | Dictionary of sample-level data for qualitative analysis. None if max_sampled_texts=0.
|
Metrics Dictionary Keys
| Key | Description |
|---|---|
| eval/rm/accuracy | Fraction of preference pairs where the model correctly assigns a higher reward to the chosen completion. |
| eval/rm/loss | Average Bradley-Terry loss across the evaluation dataset. |
| eval/rm/chosen_rewards | Average reward score assigned to chosen completions. |
| eval/rm/rejected_rewards | Average reward score assigned to rejected completions. |
| eval/rm/reward_margin | Average difference between chosen and rejected rewards: . |
Sample Table Keys
| Key | Description |
|---|---|
| shared prompt text | The prompt text shared between chosen and rejected completions. |
| chosen response text | The portion of the chosen completion that differs from the rejected completion. |
| rejected response text | The portion of the rejected completion that differs from the chosen completion. |
| chosen reward, rejected reward | A list of [chosen_score, rejected_score] rounded to 4 decimal places.
|
| correct prediction | Boolean indicating whether the model correctly ranked the chosen completion higher. |
Usage Examples
Basic Usage
from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from open_instruct.reward_modeling_eval import evaluate
from open_instruct.dataset_transformation import SimplePreferenceCollator
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
"my-reward-model", num_labels=1
)
tokenizer = AutoTokenizer.from_pretrained("my-reward-model")
# Create evaluation dataloader
collator = SimplePreferenceCollator(pad_token_id=tokenizer.pad_token_id)
eval_dataloader = DataLoader(eval_dataset, batch_size=8, collate_fn=collator)
# Run evaluation
metrics, sample_table = evaluate(
model, eval_dataloader, tokenizer, max_sampled_texts=10
)
print(f"Accuracy: {metrics['eval/rm/accuracy']:.4f}")
print(f"Loss: {metrics['eval/rm/loss']:.4f}")
print(f"Reward margin: {metrics['eval/rm/reward_margin']:.4f}")
During Training (from reward_modeling.py)
from open_instruct.reward_modeling_eval import evaluate
# Periodic evaluation within the training loop
if training_step % eval_freq == 0 and eval_dataloader is not None:
eval_metrics, table = evaluate(
model, eval_dataloader, tokenizer, max_sampled_texts=10
)
# Gather sample texts across processes for distributed training
for key in table:
table[key] = gather_object(table[key])
# Log to W&B or print to console
if accelerator.is_main_process:
print_rich_single_line_metrics(eval_metrics)
if args.with_tracking:
wandb.log({
"preference_sample_texts": wandb.Table(
dataframe=pd.DataFrame(table)
)
})
Standalone Evaluation Script
"""
The reward_modeling_eval.py file includes a __main__ block
that demonstrates standalone evaluation:
"""
from transformers import AutoModelForSequenceClassification
from open_instruct.dataset_transformation import (
TokenizerConfig, SimplePreferenceCollator,
get_cached_dataset_tulu,
CHOSEN_INPUT_IDS_KEY, REJECTED_INPUT_IDS_KEY,
)
from open_instruct.reward_modeling_eval import evaluate
model = AutoModelForSequenceClassification.from_pretrained(
"EleutherAI/pythia-14m", num_labels=1
)
tc = TokenizerConfig(tokenizer_name_or_path="EleutherAI/pythia-14m")
tokenizer = tc.tokenizer
eval_dataset = get_cached_dataset_tulu(
["trl-internal-testing/sentiment-trl-style", "1.0"],
["test"], tc,
["preference_tokenize_v1", "preference_filter_v1"],
[{}, {"max_token_length": 1024, "max_prompt_token_length": 512}],
target_columns=[CHOSEN_INPUT_IDS_KEY, REJECTED_INPUT_IDS_KEY],
)
dataloader = DataLoader(
eval_dataset, batch_size=8,
collate_fn=SimplePreferenceCollator(tokenizer.pad_token_id)
)
metrics, table = evaluate(model, dataloader, tokenizer, max_sampled_texts=5)
Dependencies
| Package | Module | Purpose |
|---|---|---|
| torch | torch.nn.functional | Bradley-Terry loss computation via F.logsigmoid
|
| torch | torch.no_grad | Disabling gradient computation during evaluation |
| tqdm | tqdm | Progress bar for evaluation loop |
| transformers | PreTrainedModel, PreTrainedTokenizer | Type annotations and model/tokenizer interfaces |
| open_instruct | model_utils.get_reward | Extracting scalar rewards from the reward model |
| open_instruct | dataset_transformation | Dataset keys (CHOSEN_INPUT_IDS_KEY, REJECTED_INPUT_IDS_KEY)
|
Implementation Details
The function uses a helper find_shared_text(chosen_text, rejected_text) to extract the common prompt prefix from decoded chosen and rejected texts. This works by iterating character-by-character until the texts diverge, which is useful for displaying sample-level results where the prompt and response portions are shown separately.
The model is set to model.eval() at the start and model.train() at the end, ensuring that evaluation-specific behaviors (e.g., deterministic batch normalization, though unlikely in transformers) are active during the evaluation pass without affecting subsequent training.