Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft BIPIA RougeRecall

From Leeroopedia
Field Value
Sources Repo: Microsoft BIPIA, Doc: rouge-score
Domains NLP, Evaluation, Metrics
Last Updated 2026-02-14

Overview

Concrete tool for computing ROUGE recall metrics for capability evaluation provided by the BIPIA benchmark library, wrapping the rouge_score library.

Description

RougeRecall extends HuggingFace's evaluate.Metric to compute ROUGE recall scores. Unlike the standard ROUGE metric (which returns F1), this implementation specifically extracts the .recall attribute from the score objects. It supports rouge1, rouge2, rougeL, and rougeLsum with optional stemming and a custom tokenizer.

This is a Wrapper Doc around the rouge_score library. The class overrides the _compute method to iterate over prediction-reference pairs, invoke the underlying rouge_scorer.RougeScorer, and then collect only the recall component of each returned Score namedtuple. When aggregation is enabled, a BootstrapAggregator produces mid-point estimates with confidence intervals; otherwise, per-sample recall scores are returned as lists.

Usage

Import for capability evaluation to measure response quality on clean data. Because the metric is not published to the HuggingFace evaluate hub, it must be loaded from a local file or registered manually as an evaluate.Metric.

Code Reference

Property Value
Source BIPIA repository
File examples/rouge.py
Lines L77-153
Signature RougeRecall._compute(self, predictions, references, rouge_types=None, use_aggregator=True, use_stemmer=False, tokenizer=None) -> dict
Import Custom -- must load from local file or register as evaluate metric

I/O Contract

Inputs

Parameter Type Required Default Description
predictions List[str] Yes -- Model-generated responses to evaluate
references List[str] or List[List[str]] Yes -- Reference (ideal) responses; supports multiple references per sample
rouge_types List[str] No ["rouge1", "rouge2", "rougeL", "rougeLsum"] Which ROUGE variants to compute
use_aggregator bool No True Whether to aggregate scores via bootstrap sampling
use_stemmer bool No False Whether to apply Porter stemmer before computing overlap

Outputs

Key Type Description
rouge1 float (aggregated) or List[float] (per-sample) ROUGE-1 recall score (0-1)
rouge2 float (aggregated) or List[float] (per-sample) ROUGE-2 recall score (0-1)
rougeL float (aggregated) or List[float] (per-sample) ROUGE-L recall score (0-1)
rougeLsum float (aggregated) or List[float] (per-sample) ROUGE-Lsum recall score (0-1)

Usage Examples

import evaluate

# Load the RougeRecall metric from the local file
rouge_recall = evaluate.load("examples/rouge.py")

# Example predictions and references
predictions = [
    "The cat sat on the mat.",
    "A quick brown fox jumped over the lazy dog."
]
references = [
    "The cat was sitting on the mat.",
    "The quick brown fox jumps over the lazy dog."
]

# Compute aggregated ROUGE recall scores
results = rouge_recall.compute(
    predictions=predictions,
    references=references,
    use_aggregator=True,
    use_stemmer=False
)

# results is a dict with keys: rouge1, rouge2, rougeL, rougeLsum
# Each value is a float in [0, 1] representing the mid-point recall estimate
print(results)
# {'rouge1': 0.87, 'rouge2': 0.72, 'rougeL': 0.85, 'rougeLsum': 0.85}
# Per-sample scores (no aggregation)
per_sample = rouge_recall.compute(
    predictions=predictions,
    references=references,
    use_aggregator=False
)

# Each value is now a List[float], one score per prediction-reference pair
print(per_sample)
# {'rouge1': [0.857, 0.889], 'rouge2': [0.667, 0.778], ...}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment