Implementation:Vibrantlabsai Ragas RougeScore

Knowledge Sources	Vibrantlabsai_Ragas
Domains	Evaluation, Metrics
Last Updated	2026-02-12 00:00 GMT

Overview

RougeScore is a non-LLM metric that computes ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores between a reference text and a generated response.

Description

This metric provides a lightweight, deterministic evaluation of text overlap between a reference and a generated response using the ROUGE scoring family. It does not require an LLM and instead relies on the rouge_score Python package with stemming enabled.

The metric supports two ROUGE variants:

rouge1: Measures unigram (single word) overlap between the reference and response.
rougeL: Measures the longest common subsequence (LCS) between the reference and response, capturing sentence-level structure similarity.

For each variant, one of three scoring modes can be selected:

fmeasure (default): The harmonic mean of precision and recall, providing a balanced measure.
precision: The fraction of response tokens that appear in the reference.
recall: The fraction of reference tokens that appear in the response.

The metric uses stemming to normalize words before comparison, which helps match inflected forms (e.g., "running" and "run").

Usage

Use this metric for fast, deterministic evaluation of text generation quality without requiring LLM inference. It is particularly useful for summarization tasks, as a baseline metric alongside LLM-based evaluations, or when LLM API costs are a concern. Note that the rouge_score package must be installed separately.

Code Reference

Source Location

Repository: Vibrantlabsai_Ragas
File: src/ragas/metrics/_rouge_score.py

Signature

@dataclass
class RougeScore(SingleTurnMetric):
    name: str = "rouge_score"
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {MetricType.SINGLE_TURN: {"reference", "response"}}
    )
    rouge_type: t.Literal["rouge1", "rougeL"] = "rougeL"
    mode: t.Literal["fmeasure", "precision", "recall"] = "fmeasure"

Import

from ragas.metrics import RougeScore

I/O Contract

Inputs

Name	Type	Required	Description
reference	str	Yes	The ground truth reference text to compare against
response	str	Yes	The AI-generated response text to evaluate

Configuration

Name	Type	Default	Description
rouge_type	Literal["rouge1", "rougeL"]	"rougeL"	The type of ROUGE score to compute (unigram overlap or longest common subsequence)
mode	Literal["fmeasure", "precision", "recall"]	"fmeasure"	The scoring mode: F1, precision, or recall

Outputs

Name	Type	Description
score	float	A value between 0.0 and 1.0 representing the ROUGE score; higher is better

Usage Examples

Basic Usage (Default rougeL F-measure)

from ragas.metrics import RougeScore
from ragas.dataset_schema import SingleTurnSample

metric = RougeScore()

sample = SingleTurnSample(
    reference="The capital of France is Paris.",
    response="Paris is the capital city of France.",
)

# score = await metric.single_turn_ascore(sample)
# Returns the rougeL fmeasure score

Using Rouge1 with Recall Mode

from ragas.metrics import RougeScore
from ragas.dataset_schema import SingleTurnSample

metric = RougeScore(rouge_type="rouge1", mode="recall")

sample = SingleTurnSample(
    reference="The quick brown fox jumps over the lazy dog.",
    response="A quick brown fox jumped over a lazy dog.",
)

# score = await metric.single_turn_ascore(sample)
# Returns the rouge1 recall score

Related Pages

Environment:Vibrantlabsai_Ragas_Python_3_9_Core_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment