Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:CarperAI Trlx ROUGE Evaluation

From Leeroopedia


Knowledge Sources
Domains Evaluation, NLP, Summarization
Last Updated 2026-02-07 16:00 GMT

Overview

An evaluation principle for measuring the quality of generated text summaries using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics.

Description

ROUGE is the standard automatic metric for evaluating text summarization quality. It measures the overlap between generated summaries and reference summaries using n-gram matching, longest common subsequence, and other text overlap measures. In the RLHF summarization pipeline, ROUGE scores are used alongside reward model scores to evaluate the PPO-trained model against ground truth summaries.

ROUGE provides multiple variants: ROUGE-1 (unigram overlap), ROUGE-2 (bigram overlap), ROUGE-L (longest common subsequence), and ROUGE-Lsum (summary-level LCS). These capture different aspects of summary quality from lexical overlap to structural similarity.

Usage

Use ROUGE evaluation when assessing text summarization models after training. ROUGE provides a complementary signal to reward model scores: ROUGE measures factual overlap with reference summaries while reward models measure overall quality. The combination helps detect reward hacking (high reward but low ROUGE would indicate summaries that game the reward model without being genuinely good).

Theoretical Basis

ROUGE-N measures n-gram recall:

ROUGE-N=SrefgramnSCountmatch(gramn)SrefgramnSCount(gramn)

ROUGE-L uses the Longest Common Subsequence:

ROUGE-L=(1+β2)RlcsPlcsRlcs+β2Plcs

Where Rlcs=LCS(X,Y)|Y| (recall) and Plcs=LCS(X,Y)|X| (precision).

Typical ROUGE scores for RLHF summarization:

  • SFT baseline: ROUGE-1 ~30, ROUGE-2 ~8
  • PPO-optimized: ROUGE-1 ~32, ROUGE-2 ~9 (modest improvement)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment