Implementation:Vibrantlabsai Ragas RougeScore
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Metrics |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
RougeScore is a non-LLM metric that computes ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores between a reference text and a generated response.
Description
This metric provides a lightweight, deterministic evaluation of text overlap between a reference and a generated response using the ROUGE scoring family. It does not require an LLM and instead relies on the rouge_score Python package with stemming enabled.
The metric supports two ROUGE variants:
- rouge1: Measures unigram (single word) overlap between the reference and response.
- rougeL: Measures the longest common subsequence (LCS) between the reference and response, capturing sentence-level structure similarity.
For each variant, one of three scoring modes can be selected:
- fmeasure (default): The harmonic mean of precision and recall, providing a balanced measure.
- precision: The fraction of response tokens that appear in the reference.
- recall: The fraction of reference tokens that appear in the response.
The metric uses stemming to normalize words before comparison, which helps match inflected forms (e.g., "running" and "run").
Usage
Use this metric for fast, deterministic evaluation of text generation quality without requiring LLM inference. It is particularly useful for summarization tasks, as a baseline metric alongside LLM-based evaluations, or when LLM API costs are a concern. Note that the rouge_score package must be installed separately.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File: src/ragas/metrics/_rouge_score.py
Signature
@dataclass
class RougeScore(SingleTurnMetric):
name: str = "rouge_score"
_required_columns: t.Dict[MetricType, t.Set[str]] = field(
default_factory=lambda: {MetricType.SINGLE_TURN: {"reference", "response"}}
)
rouge_type: t.Literal["rouge1", "rougeL"] = "rougeL"
mode: t.Literal["fmeasure", "precision", "recall"] = "fmeasure"
Import
from ragas.metrics import RougeScore
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| reference | str | Yes | The ground truth reference text to compare against |
| response | str | Yes | The AI-generated response text to evaluate |
Configuration
| Name | Type | Default | Description |
|---|---|---|---|
| rouge_type | Literal["rouge1", "rougeL"] | "rougeL" | The type of ROUGE score to compute (unigram overlap or longest common subsequence) |
| mode | Literal["fmeasure", "precision", "recall"] | "fmeasure" | The scoring mode: F1, precision, or recall |
Outputs
| Name | Type | Description |
|---|---|---|
| score | float | A value between 0.0 and 1.0 representing the ROUGE score; higher is better |
Usage Examples
Basic Usage (Default rougeL F-measure)
from ragas.metrics import RougeScore
from ragas.dataset_schema import SingleTurnSample
metric = RougeScore()
sample = SingleTurnSample(
reference="The capital of France is Paris.",
response="Paris is the capital city of France.",
)
# score = await metric.single_turn_ascore(sample)
# Returns the rougeL fmeasure score
Using Rouge1 with Recall Mode
from ragas.metrics import RougeScore
from ragas.dataset_schema import SingleTurnSample
metric = RougeScore(rouge_type="rouge1", mode="recall")
sample = SingleTurnSample(
reference="The quick brown fox jumps over the lazy dog.",
response="A quick brown fox jumped over a lazy dog.",
)
# score = await metric.single_turn_ascore(sample)
# Returns the rouge1 recall score