Implementation:Vibrantlabsai Ragas RougeScoreV2
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Metrics |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
RougeScore is a class-based v2 metric that calculates ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores between reference and response texts using the rouge_score library, with configurable ROUGE type and scoring mode.
Description
The RougeScore metric provides a modern, class-based implementation of ROUGE scoring for evaluating text summarization and generation quality. It inherits from BaseMetric and does not require any LLM or embedding model.
The metric supports two ROUGE types:
- rouge1 -- Measures unigram (single word) overlap between the reference and response. This captures word-level content similarity.
- rougeL -- Measures the Longest Common Subsequence (LCS) between the reference and response. This captures sentence-level structure similarity and is the default.
The metric also supports three scoring modes:
- fmeasure (default) -- The harmonic mean of precision and recall, providing a balanced measure.
- precision -- The fraction of response n-grams or subsequences that appear in the reference.
- recall -- The fraction of reference n-grams or subsequences that appear in the response.
Internally, the metric uses Google's rouge_score library with stemming enabled (use_stemmer=True) to normalize word forms before comparison.
The metric returns a MetricResult object with the score as a float between 0.0 and 1.0.
Usage
Use RougeScore as a standard text overlap metric for evaluating summarization or text generation quality. It is particularly appropriate for measuring recall-oriented content overlap. The rouge_score library must be installed separately (pip install rouge_score). Choose rouge1 for word-level overlap and rougeL for structure-preserving overlap.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File: src/ragas/metrics/collections/_rouge_score.py
Signature
class RougeScore(BaseMetric):
def __init__(
self,
name: str = "rouge_score",
rouge_type: t.Literal["rouge1", "rougeL"] = "rougeL",
mode: t.Literal["fmeasure", "precision", "recall"] = "fmeasure",
**kwargs,
):
Import
from ragas.metrics.collections import RougeScore
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| reference | str | Yes | The reference/ground truth text |
| response | str | Yes | The response text to evaluate against the reference |
| rouge_type | Literal["rouge1", "rougeL"] | No | ROUGE variant to use (default: "rougeL") |
| mode | Literal["fmeasure", "precision", "recall"] | No | Scoring mode (default: "fmeasure") |
Outputs
| Name | Type | Description |
|---|---|---|
| result | MetricResult | A MetricResult object with a value attribute containing the ROUGE score between 0.0 and 1.0
|
Usage Examples
Basic Usage
from ragas.metrics.collections import RougeScore
metric = RougeScore()
result = await metric.ascore(
reference="The capital of France is Paris.",
response="Paris is the capital of France."
)
print(f"ROUGE-L F-measure: {result.value}")
Using ROUGE-1 with Recall Mode
from ragas.metrics.collections import RougeScore
metric = RougeScore(rouge_type="rouge1", mode="recall")
result = await metric.ascore(
reference="The quick brown fox jumps over the lazy dog.",
response="A quick brown fox leaps over the lazy dog."
)
print(f"ROUGE-1 Recall: {result.value}")
Batch Evaluation
from ragas.metrics.collections import RougeScore
metric = RougeScore(rouge_type="rougeL", mode="fmeasure")
results = await metric.abatch_score([
{"reference": "The cat sat on the mat.", "response": "A cat was on a mat."},
{"reference": "It is raining outside.", "response": "It is raining outside."},
])
for i, result in enumerate(results):
print(f"Sample {i}: ROUGE-L = {result.value}")