Implementation:CarperAI Trlx ROUGE Metric Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, NLP, Summarization |
| Last Updated | 2026-02-07 16:00 GMT |
Overview
Concrete tool for computing ROUGE evaluation metrics on generated summaries using the HuggingFace evaluate library.
Description
The RLHF summarization evaluation script in trlx uses the HuggingFace evaluate library to compute ROUGE scores. The script loads a PPO-trained model, generates summaries for test prompts, and computes ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum against reference summaries. It also scores the generated summaries with the reward model for comparison. This is an External Tool Doc documenting how the evaluate library is used within trlx.
Usage
Use ROUGE evaluation after PPO training on summarization tasks to measure factual overlap between generated and reference summaries. Run the evaluation script with the trained model path to get ROUGE scores and reward model comparisons.
Code Reference
Source Location
- Repository: trlx
- File: examples/summarize_rlhf/trlx_inference_gptj.py
- Lines: L63-90 (ROUGE computation)
Signature
import evaluate
# Load ROUGE metric
rouge = evaluate.load("rouge")
# Compute scores
results = rouge.compute(
predictions=generated_summaries, # List[str]
references=reference_summaries, # List[str]
)
# Returns: {"rouge1": float, "rouge2": float, "rougeL": float, "rougeLsum": float}
Import
import evaluate
rouge = evaluate.load("rouge")
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| predictions | List[str] | Yes | Generated summary texts |
| references | List[str] | Yes | Reference (ground truth) summary texts |
Outputs
| Name | Type | Description |
|---|---|---|
| rouge1 | float | Unigram overlap F1 score |
| rouge2 | float | Bigram overlap F1 score |
| rougeL | float | Longest common subsequence F1 score |
| rougeLsum | float | Summary-level LCS F1 score |
Usage Examples
Evaluate PPO-Trained Summarization Model
import evaluate
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
# Load trained model
model = AutoModelForCausalLM.from_pretrained("CarperAI/openai_summarize_tldr_ppo")
tokenizer = AutoTokenizer.from_pretrained("CarperAI/openai_summarize_tldr_ppo")
# Load test dataset
dataset = load_dataset("CarperAI/openai_summarize_tldr", split="test[:100]")
# Generate summaries
predictions = []
for sample in dataset:
prompt = sample["prompt"] + "\nTL;DR:"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50)
generated = tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
predictions.append(generated)
references = [sample["label"] for sample in dataset]
# Compute ROUGE
rouge = evaluate.load("rouge")
results = rouge.compute(predictions=predictions, references=references)
print(f"ROUGE-1: {results['rouge1']:.4f}")
print(f"ROUGE-2: {results['rouge2']:.4f}")
print(f"ROUGE-L: {results['rougeL']:.4f}")
Combined ROUGE + Reward Model Evaluation
import pandas as pd
# Compute both ROUGE and reward scores
rouge_results = rouge.compute(predictions=predictions, references=references)
pred_rewards = reward_fn(predictions)
ref_rewards = reward_fn(references)
# Log results
results_df = pd.DataFrame({
"prediction": predictions,
"reference": references,
"pred_reward": pred_rewards,
"ref_reward": ref_rewards,
})
results_df.to_csv("ppo_with_reward_scores.csv")
print(f"ROUGE-1: {rouge_results['rouge1']:.4f}")
print(f"Mean pred reward: {sum(pred_rewards)/len(pred_rewards):.4f}")
print(f"Mean ref reward: {sum(ref_rewards)/len(ref_rewards):.4f}")