Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vibrantlabsai Ragas RougeScore

From Leeroopedia
Knowledge Sources
Domains Evaluation, Metrics
Last Updated 2026-02-12 00:00 GMT

Overview

RougeScore is a non-LLM metric that computes ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores between a reference text and a generated response.

Description

This metric provides a lightweight, deterministic evaluation of text overlap between a reference and a generated response using the ROUGE scoring family. It does not require an LLM and instead relies on the rouge_score Python package with stemming enabled.

The metric supports two ROUGE variants:

  • rouge1: Measures unigram (single word) overlap between the reference and response.
  • rougeL: Measures the longest common subsequence (LCS) between the reference and response, capturing sentence-level structure similarity.

For each variant, one of three scoring modes can be selected:

  • fmeasure (default): The harmonic mean of precision and recall, providing a balanced measure.
  • precision: The fraction of response tokens that appear in the reference.
  • recall: The fraction of reference tokens that appear in the response.

The metric uses stemming to normalize words before comparison, which helps match inflected forms (e.g., "running" and "run").

Usage

Use this metric for fast, deterministic evaluation of text generation quality without requiring LLM inference. It is particularly useful for summarization tasks, as a baseline metric alongside LLM-based evaluations, or when LLM API costs are a concern. Note that the rouge_score package must be installed separately.

Code Reference

Source Location

Signature

@dataclass
class RougeScore(SingleTurnMetric):
    name: str = "rouge_score"
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {MetricType.SINGLE_TURN: {"reference", "response"}}
    )
    rouge_type: t.Literal["rouge1", "rougeL"] = "rougeL"
    mode: t.Literal["fmeasure", "precision", "recall"] = "fmeasure"

Import

from ragas.metrics import RougeScore

I/O Contract

Inputs

Name Type Required Description
reference str Yes The ground truth reference text to compare against
response str Yes The AI-generated response text to evaluate

Configuration

Name Type Default Description
rouge_type Literal["rouge1", "rougeL"] "rougeL" The type of ROUGE score to compute (unigram overlap or longest common subsequence)
mode Literal["fmeasure", "precision", "recall"] "fmeasure" The scoring mode: F1, precision, or recall

Outputs

Name Type Description
score float A value between 0.0 and 1.0 representing the ROUGE score; higher is better

Usage Examples

Basic Usage (Default rougeL F-measure)

from ragas.metrics import RougeScore
from ragas.dataset_schema import SingleTurnSample

metric = RougeScore()

sample = SingleTurnSample(
    reference="The capital of France is Paris.",
    response="Paris is the capital city of France.",
)

# score = await metric.single_turn_ascore(sample)
# Returns the rougeL fmeasure score

Using Rouge1 with Recall Mode

from ragas.metrics import RougeScore
from ragas.dataset_schema import SingleTurnSample

metric = RougeScore(rouge_type="rouge1", mode="recall")

sample = SingleTurnSample(
    reference="The quick brown fox jumps over the lazy dog.",
    response="A quick brown fox jumped over a lazy dog.",
)

# score = await metric.single_turn_ascore(sample)
# Returns the rouge1 recall score

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment