Implementation:Microsoft BIPIA RougeRecall
| Field | Value |
|---|---|
| Sources | Repo: Microsoft BIPIA, Doc: rouge-score |
| Domains | NLP, Evaluation, Metrics |
| Last Updated | 2026-02-14 |
Overview
Concrete tool for computing ROUGE recall metrics for capability evaluation provided by the BIPIA benchmark library, wrapping the rouge_score library.
Description
RougeRecall extends HuggingFace's evaluate.Metric to compute ROUGE recall scores. Unlike the standard ROUGE metric (which returns F1), this implementation specifically extracts the .recall attribute from the score objects. It supports rouge1, rouge2, rougeL, and rougeLsum with optional stemming and a custom tokenizer.
This is a Wrapper Doc around the rouge_score library. The class overrides the _compute method to iterate over prediction-reference pairs, invoke the underlying rouge_scorer.RougeScorer, and then collect only the recall component of each returned Score namedtuple. When aggregation is enabled, a BootstrapAggregator produces mid-point estimates with confidence intervals; otherwise, per-sample recall scores are returned as lists.
Usage
Import for capability evaluation to measure response quality on clean data. Because the metric is not published to the HuggingFace evaluate hub, it must be loaded from a local file or registered manually as an evaluate.Metric.
Code Reference
| Property | Value |
|---|---|
| Source | BIPIA repository |
| File | examples/rouge.py
|
| Lines | L77-153 |
| Signature | RougeRecall._compute(self, predictions, references, rouge_types=None, use_aggregator=True, use_stemmer=False, tokenizer=None) -> dict
|
| Import | Custom -- must load from local file or register as evaluate metric |
I/O Contract
Inputs
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
predictions |
List[str] |
Yes | -- | Model-generated responses to evaluate |
references |
List[str] or List[List[str]] |
Yes | -- | Reference (ideal) responses; supports multiple references per sample |
rouge_types |
List[str] |
No | ["rouge1", "rouge2", "rougeL", "rougeLsum"] |
Which ROUGE variants to compute |
use_aggregator |
bool |
No | True |
Whether to aggregate scores via bootstrap sampling |
use_stemmer |
bool |
No | False |
Whether to apply Porter stemmer before computing overlap |
Outputs
| Key | Type | Description |
|---|---|---|
rouge1 |
float (aggregated) or List[float] (per-sample) |
ROUGE-1 recall score (0-1) |
rouge2 |
float (aggregated) or List[float] (per-sample) |
ROUGE-2 recall score (0-1) |
rougeL |
float (aggregated) or List[float] (per-sample) |
ROUGE-L recall score (0-1) |
rougeLsum |
float (aggregated) or List[float] (per-sample) |
ROUGE-Lsum recall score (0-1) |
Usage Examples
import evaluate
# Load the RougeRecall metric from the local file
rouge_recall = evaluate.load("examples/rouge.py")
# Example predictions and references
predictions = [
"The cat sat on the mat.",
"A quick brown fox jumped over the lazy dog."
]
references = [
"The cat was sitting on the mat.",
"The quick brown fox jumps over the lazy dog."
]
# Compute aggregated ROUGE recall scores
results = rouge_recall.compute(
predictions=predictions,
references=references,
use_aggregator=True,
use_stemmer=False
)
# results is a dict with keys: rouge1, rouge2, rougeL, rougeLsum
# Each value is a float in [0, 1] representing the mid-point recall estimate
print(results)
# {'rouge1': 0.87, 'rouge2': 0.72, 'rougeL': 0.85, 'rougeLsum': 0.85}
# Per-sample scores (no aggregation)
per_sample = rouge_recall.compute(
predictions=predictions,
references=references,
use_aggregator=False
)
# Each value is now a List[float], one score per prediction-reference pair
print(per_sample)
# {'rouge1': [0.857, 0.889], 'rouge2': [0.667, 0.778], ...}