Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vibrantlabsai Ragas BleuScoreV2

From Leeroopedia
Knowledge Sources
Domains Evaluation, Metrics
Last Updated 2026-02-12 00:00 GMT

Overview

BleuScore is a class-based v2 metric that calculates the BLEU (Bilingual Evaluation Understudy) score between reference and response texts using the sacrebleu library, with automatic validation and async design.

Description

The BleuScore metric provides a modern, class-based implementation of the BLEU score for evaluating text generation quality. It inherits from BaseMetric (which combines SimpleBaseMetric and NumericValidator) and does not require any LLM or embedding model.

The scoring algorithm works as follows:

  1. The reference and response texts are split into sentences using ". " (period followed by space) as the delimiter.
  2. The reference sentences are formatted as a list of single-reference lists ([[ref1], [ref2], ...]) to conform to the sacrebleu API.
  3. The sacrebleu.corpus_bleu function computes the BLEU score across the sentence pairs.
  4. The raw BLEU score (0-100 scale) is divided by 100 to normalize it to the 0.0-1.0 range.

Additional keyword arguments can be passed to corpus_bleu via the kwargs parameter, allowing customization of n-gram weights, smoothing methods, and other sacrebleu options.

The metric returns a MetricResult object that behaves like a float but also carries optional metadata such as reasoning traces.

Usage

Use BleuScore when you need a standard n-gram overlap metric for evaluating machine translation or text generation quality. It is particularly useful as a baseline metric that does not require any external AI services. The sacrebleu library must be installed separately (pip install sacrebleu).

Code Reference

Source Location

Signature

class BleuScore(BaseMetric):
    def __init__(
        self,
        name: str = "bleu_score",
        kwargs: t.Optional[t.Dict[str, t.Any]] = None,
        **base_kwargs,
    ):

Import

from ragas.metrics.collections import BleuScore

I/O Contract

Inputs

Name Type Required Description
reference str Yes The reference/ground truth text
response str Yes The response text to evaluate against the reference
kwargs Dict[str, Any] No Additional arguments passed to sacrebleu.corpus_bleu (e.g., smoothing method, n-gram weights)

Outputs

Name Type Description
result MetricResult A MetricResult object with a value attribute containing the BLEU score between 0.0 and 1.0

Usage Examples

Basic Usage

from ragas.metrics.collections import BleuScore

metric = BleuScore()

result = await metric.ascore(
    reference="The capital of France is Paris.",
    response="Paris is the capital of France."
)
print(f"BLEU Score: {result.value}")

Batch Evaluation

from ragas.metrics.collections import BleuScore

metric = BleuScore()

results = await metric.abatch_score([
    {"reference": "The cat sat on the mat.", "response": "A cat was sitting on a mat."},
    {"reference": "It is raining outside.", "response": "It is raining outside."},
])

for i, result in enumerate(results):
    print(f"Sample {i}: BLEU = {result.value}")

Custom sacrebleu Arguments

from ragas.metrics.collections import BleuScore

# Pass additional arguments to sacrebleu.corpus_bleu
metric = BleuScore(kwargs={"smooth_method": "exp"})

result = await metric.ascore(
    reference="The quick brown fox jumps over the lazy dog.",
    response="A fast brown fox leaps over a sleepy dog."
)
print(f"BLEU Score (smoothed): {result.value}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment