Principle:CarperAI Trlx ROUGE Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, NLP, Summarization |
| Last Updated | 2026-02-07 16:00 GMT |
Overview
An evaluation principle for measuring the quality of generated text summaries using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics.
Description
ROUGE is the standard automatic metric for evaluating text summarization quality. It measures the overlap between generated summaries and reference summaries using n-gram matching, longest common subsequence, and other text overlap measures. In the RLHF summarization pipeline, ROUGE scores are used alongside reward model scores to evaluate the PPO-trained model against ground truth summaries.
ROUGE provides multiple variants: ROUGE-1 (unigram overlap), ROUGE-2 (bigram overlap), ROUGE-L (longest common subsequence), and ROUGE-Lsum (summary-level LCS). These capture different aspects of summary quality from lexical overlap to structural similarity.
Usage
Use ROUGE evaluation when assessing text summarization models after training. ROUGE provides a complementary signal to reward model scores: ROUGE measures factual overlap with reference summaries while reward models measure overall quality. The combination helps detect reward hacking (high reward but low ROUGE would indicate summaries that game the reward model without being genuinely good).
Theoretical Basis
ROUGE-N measures n-gram recall:
ROUGE-L uses the Longest Common Subsequence:
Where (recall) and (precision).
Typical ROUGE scores for RLHF summarization:
- SFT baseline: ROUGE-1 ~30, ROUGE-2 ~8
- PPO-optimized: ROUGE-1 ~32, ROUGE-2 ~9 (modest improvement)