Implementation:Microsoft BIPIA RougeRecall

Field	Value
Sources	Repo: Microsoft BIPIA, Doc: rouge-score
Domains	NLP, Evaluation, Metrics
Last Updated	2026-02-14

Overview

Concrete tool for computing ROUGE recall metrics for capability evaluation provided by the BIPIA benchmark library, wrapping the rouge_score library.

Description

RougeRecall extends HuggingFace's evaluate.Metric to compute ROUGE recall scores. Unlike the standard ROUGE metric (which returns F1), this implementation specifically extracts the .recall attribute from the score objects. It supports rouge1, rouge2, rougeL, and rougeLsum with optional stemming and a custom tokenizer.

This is a Wrapper Doc around the rouge_score library. The class overrides the _compute method to iterate over prediction-reference pairs, invoke the underlying rouge_scorer.RougeScorer, and then collect only the recall component of each returned Score namedtuple. When aggregation is enabled, a BootstrapAggregator produces mid-point estimates with confidence intervals; otherwise, per-sample recall scores are returned as lists.

Usage

Import for capability evaluation to measure response quality on clean data. Because the metric is not published to the HuggingFace evaluate hub, it must be loaded from a local file or registered manually as an evaluate.Metric.

Code Reference

Property	Value
Source	BIPIA repository
File	`examples/rouge.py`
Lines	L77-153
Signature	`RougeRecall._compute(self, predictions, references, rouge_types=None, use_aggregator=True, use_stemmer=False, tokenizer=None) -> dict`
Import	Custom -- must load from local file or register as evaluate metric

I/O Contract

Inputs

Parameter	Type	Required	Default	Description
`predictions`	`List[str]`	Yes	--	Model-generated responses to evaluate
`references`	`List[str]` or `List[List[str]]`	Yes	--	Reference (ideal) responses; supports multiple references per sample
`rouge_types`	`List[str]`	No	`["rouge1", "rouge2", "rougeL", "rougeLsum"]`	Which ROUGE variants to compute
`use_aggregator`	`bool`	No	`True`	Whether to aggregate scores via bootstrap sampling
`use_stemmer`	`bool`	No	`False`	Whether to apply Porter stemmer before computing overlap

Outputs

Key	Type	Description
`rouge1`	`float` (aggregated) or `List[float]` (per-sample)	ROUGE-1 recall score (0-1)
`rouge2`	`float` (aggregated) or `List[float]` (per-sample)	ROUGE-2 recall score (0-1)
`rougeL`	`float` (aggregated) or `List[float]` (per-sample)	ROUGE-L recall score (0-1)
`rougeLsum`	`float` (aggregated) or `List[float]` (per-sample)	ROUGE-Lsum recall score (0-1)

Usage Examples

import evaluate

# Load the RougeRecall metric from the local file
rouge_recall = evaluate.load("examples/rouge.py")

# Example predictions and references
predictions = [
    "The cat sat on the mat.",
    "A quick brown fox jumped over the lazy dog."
]
references = [
    "The cat was sitting on the mat.",
    "The quick brown fox jumps over the lazy dog."
]

# Compute aggregated ROUGE recall scores
results = rouge_recall.compute(
    predictions=predictions,
    references=references,
    use_aggregator=True,
    use_stemmer=False
)

# results is a dict with keys: rouge1, rouge2, rougeL, rougeLsum
# Each value is a float in [0, 1] representing the mid-point recall estimate
print(results)
# {'rouge1': 0.87, 'rouge2': 0.72, 'rougeL': 0.85, 'rougeLsum': 0.85}

# Per-sample scores (no aggregation)
per_sample = rouge_recall.compute(
    predictions=predictions,
    references=references,
    use_aggregator=False
)

# Each value is now a List[float], one score per prediction-reference pair
print(per_sample)
# {'rouge1': [0.857, 0.889], 'rouge2': [0.667, 0.778], ...}

Related Pages

Principle:Microsoft_BIPIA_Results_Analysis

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment