Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:CarperAI Trlx ROUGE Metric Evaluation

From Leeroopedia


Knowledge Sources
Domains Evaluation, NLP, Summarization
Last Updated 2026-02-07 16:00 GMT

Overview

Concrete tool for computing ROUGE evaluation metrics on generated summaries using the HuggingFace evaluate library.

Description

The RLHF summarization evaluation script in trlx uses the HuggingFace evaluate library to compute ROUGE scores. The script loads a PPO-trained model, generates summaries for test prompts, and computes ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum against reference summaries. It also scores the generated summaries with the reward model for comparison. This is an External Tool Doc documenting how the evaluate library is used within trlx.

Usage

Use ROUGE evaluation after PPO training on summarization tasks to measure factual overlap between generated and reference summaries. Run the evaluation script with the trained model path to get ROUGE scores and reward model comparisons.

Code Reference

Source Location

  • Repository: trlx
  • File: examples/summarize_rlhf/trlx_inference_gptj.py
  • Lines: L63-90 (ROUGE computation)

Signature

import evaluate

# Load ROUGE metric
rouge = evaluate.load("rouge")

# Compute scores
results = rouge.compute(
    predictions=generated_summaries,  # List[str]
    references=reference_summaries,   # List[str]
)
# Returns: {"rouge1": float, "rouge2": float, "rougeL": float, "rougeLsum": float}

Import

import evaluate
rouge = evaluate.load("rouge")

I/O Contract

Inputs

Name Type Required Description
predictions List[str] Yes Generated summary texts
references List[str] Yes Reference (ground truth) summary texts

Outputs

Name Type Description
rouge1 float Unigram overlap F1 score
rouge2 float Bigram overlap F1 score
rougeL float Longest common subsequence F1 score
rougeLsum float Summary-level LCS F1 score

Usage Examples

Evaluate PPO-Trained Summarization Model

import evaluate
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# Load trained model
model = AutoModelForCausalLM.from_pretrained("CarperAI/openai_summarize_tldr_ppo")
tokenizer = AutoTokenizer.from_pretrained("CarperAI/openai_summarize_tldr_ppo")

# Load test dataset
dataset = load_dataset("CarperAI/openai_summarize_tldr", split="test[:100]")

# Generate summaries
predictions = []
for sample in dataset:
    prompt = sample["prompt"] + "\nTL;DR:"
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    output = model.generate(input_ids, max_new_tokens=50)
    generated = tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
    predictions.append(generated)

references = [sample["label"] for sample in dataset]

# Compute ROUGE
rouge = evaluate.load("rouge")
results = rouge.compute(predictions=predictions, references=references)
print(f"ROUGE-1: {results['rouge1']:.4f}")
print(f"ROUGE-2: {results['rouge2']:.4f}")
print(f"ROUGE-L: {results['rougeL']:.4f}")

Combined ROUGE + Reward Model Evaluation

import pandas as pd

# Compute both ROUGE and reward scores
rouge_results = rouge.compute(predictions=predictions, references=references)
pred_rewards = reward_fn(predictions)
ref_rewards = reward_fn(references)

# Log results
results_df = pd.DataFrame({
    "prediction": predictions,
    "reference": references,
    "pred_reward": pred_rewards,
    "ref_reward": ref_rewards,
})
results_df.to_csv("ppo_with_reward_scores.csv")

print(f"ROUGE-1: {rouge_results['rouge1']:.4f}")
print(f"Mean pred reward: {sum(pred_rewards)/len(pred_rewards):.4f}")
print(f"Mean ref reward: {sum(ref_rewards)/len(ref_rewards):.4f}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment