Implementation:Vibrantlabsai Ragas BleuScoreV2

Knowledge Sources	Vibrantlabsai_Ragas
Domains	Evaluation, Metrics
Last Updated	2026-02-12 00:00 GMT

Overview

BleuScore is a class-based v2 metric that calculates the BLEU (Bilingual Evaluation Understudy) score between reference and response texts using the sacrebleu library, with automatic validation and async design.

Description

The BleuScore metric provides a modern, class-based implementation of the BLEU score for evaluating text generation quality. It inherits from BaseMetric (which combines SimpleBaseMetric and NumericValidator) and does not require any LLM or embedding model.

The scoring algorithm works as follows:

The reference and response texts are split into sentences using ". " (period followed by space) as the delimiter.
The reference sentences are formatted as a list of single-reference lists ([[ref1], [ref2], ...]) to conform to the sacrebleu API.
The sacrebleu.corpus_bleu function computes the BLEU score across the sentence pairs.
The raw BLEU score (0-100 scale) is divided by 100 to normalize it to the 0.0-1.0 range.

Additional keyword arguments can be passed to corpus_bleu via the kwargs parameter, allowing customization of n-gram weights, smoothing methods, and other sacrebleu options.

The metric returns a MetricResult object that behaves like a float but also carries optional metadata such as reasoning traces.

Usage

Use BleuScore when you need a standard n-gram overlap metric for evaluating machine translation or text generation quality. It is particularly useful as a baseline metric that does not require any external AI services. The sacrebleu library must be installed separately (pip install sacrebleu).

Code Reference

Source Location

Repository: Vibrantlabsai_Ragas
File: src/ragas/metrics/collections/_bleu_score.py

Signature

class BleuScore(BaseMetric):
    def __init__(
        self,
        name: str = "bleu_score",
        kwargs: t.Optional[t.Dict[str, t.Any]] = None,
        **base_kwargs,
    ):

Import

from ragas.metrics.collections import BleuScore

I/O Contract

Inputs

Name	Type	Required	Description
reference	str	Yes	The reference/ground truth text
response	str	Yes	The response text to evaluate against the reference
kwargs	Dict[str, Any]	No	Additional arguments passed to sacrebleu.corpus_bleu (e.g., smoothing method, n-gram weights)

Outputs

Name	Type	Description
result	MetricResult	A MetricResult object with a `value` attribute containing the BLEU score between 0.0 and 1.0

Usage Examples

Basic Usage

from ragas.metrics.collections import BleuScore

metric = BleuScore()

result = await metric.ascore(
    reference="The capital of France is Paris.",
    response="Paris is the capital of France."
)
print(f"BLEU Score: {result.value}")

Batch Evaluation

from ragas.metrics.collections import BleuScore

metric = BleuScore()

results = await metric.abatch_score([
    {"reference": "The cat sat on the mat.", "response": "A cat was sitting on a mat."},
    {"reference": "It is raining outside.", "response": "It is raining outside."},
])

for i, result in enumerate(results):
    print(f"Sample {i}: BLEU = {result.value}")

Custom sacrebleu Arguments

from ragas.metrics.collections import BleuScore

# Pass additional arguments to sacrebleu.corpus_bleu
metric = BleuScore(kwargs={"smooth_method": "exp"})

result = await metric.ascore(
    reference="The quick brown fox jumps over the lazy dog.",
    response="A fast brown fox leaps over a sleepy dog."
)
print(f"BLEU Score (smoothed): {result.value}")

Related Pages

Environment:Vibrantlabsai_Ragas_Python_3_9_Core_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment