Principle:CarperAI Trlx ROUGE Evaluation

Knowledge Sources	ROUGE: A Package for Automatic Evaluation of Summaries Learning to Summarize from Human Feedback CarperAI trlx
Domains	Evaluation, NLP, Summarization
Last Updated	2026-02-07 16:00 GMT

Overview

An evaluation principle for measuring the quality of generated text summaries using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics.

Description

ROUGE is the standard automatic metric for evaluating text summarization quality. It measures the overlap between generated summaries and reference summaries using n-gram matching, longest common subsequence, and other text overlap measures. In the RLHF summarization pipeline, ROUGE scores are used alongside reward model scores to evaluate the PPO-trained model against ground truth summaries.

ROUGE provides multiple variants: ROUGE-1 (unigram overlap), ROUGE-2 (bigram overlap), ROUGE-L (longest common subsequence), and ROUGE-Lsum (summary-level LCS). These capture different aspects of summary quality from lexical overlap to structural similarity.

Usage

Use ROUGE evaluation when assessing text summarization models after training. ROUGE provides a complementary signal to reward model scores: ROUGE measures factual overlap with reference summaries while reward models measure overall quality. The combination helps detect reward hacking (high reward but low ROUGE would indicate summaries that game the reward model without being genuinely good).

Theoretical Basis

ROUGE-N measures n-gram recall:

$R O U G E - N = \frac{\sum_{S \in ref} \sum_{{gram}_{n} \in S} C o u n t_{m a t c h} ({gram}_{n})}{\sum_{S \in ref} \sum_{{gram}_{n} \in S} C o u n t ({gram}_{n})}$

ROUGE-L uses the Longest Common Subsequence:

$R O U G E - L = \frac{(1 + β^{2}) \cdot R_{l c s} \cdot P_{l c s}}{R_{l c s} + β^{2} \cdot P_{l c s}}$

Where $R_{l c s} = \frac{L C S (X, Y)}{| Y |}$ (recall) and $P_{l c s} = \frac{L C S (X, Y)}{| X |}$ (precision).

Typical ROUGE scores for RLHF summarization:

SFT baseline: ROUGE-1 ~30, ROUGE-2 ~8
PPO-optimized: ROUGE-1 ~32, ROUGE-2 ~9 (modest improvement)

Related Pages

Implemented By

Implementation:CarperAI_Trlx_ROUGE_Metric_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment