Implementation:Vibrantlabsai Ragas BleuScore

Knowledge Sources	Vibrantlabsai_Ragas
Domains	Evaluation, Metrics
Last Updated	2026-02-12 00:00 GMT

Overview

BleuScore computes the BLEU (Bilingual Evaluation Understudy) score between a generated response and a reference answer using the sacrebleu library.

Description

The BleuScore metric evaluates the quality of a generated response by computing its BLEU score against a reference answer. BLEU is a well-established metric originally designed for machine translation evaluation that measures n-gram overlap between a candidate text and one or more reference texts.

The implementation uses the sacrebleu library's corpus_bleu function. The text is split into sentences by splitting on "'. " (period followed by space), and the BLEU score is computed at the corpus level across these sentence pairs. The raw sacrebleu score (which ranges from 0 to 100) is normalized by dividing by 100 to produce a score between 0.0 and 1.0.

This metric does not require an LLM or embedding model -- it is a purely statistical text comparison metric. It only requires the sacrebleu package, which must be installed separately (pip install sacrebleu).

Additional keyword arguments can be passed through the kwargs dictionary to customize the underlying corpus_bleu function (e.g., smoothing method, effective order).

Usage

Use this metric when you want a fast, deterministic, reference-based evaluation of text similarity at the n-gram level. It is useful as a baseline metric or when LLM-based evaluation is not desired. Note that BLEU focuses on exact n-gram overlap and may not capture semantic similarity.

Code Reference

Source Location

Repository: Vibrantlabsai_Ragas
File: src/ragas/metrics/_bleu_score.py

Signature

@dataclass
class BleuScore(SingleTurnMetric):
    name: str = "bleu_score"
    kwargs: t.Dict[str, t.Any] = field(default_factory=dict)

Import

from ragas.metrics import BleuScore

I/O Contract

Inputs

Name	Type	Required	Description
reference	str	Yes	The ground truth reference answer
response	str	Yes	The generated response to evaluate
kwargs	dict	No	Additional keyword arguments passed to sacrebleu's corpus_bleu function

Outputs

Name	Type	Description
score	float	BLEU score normalized to the range 0.0 to 1.0

Dependencies

This metric requires the sacrebleu package:

pip install sacrebleu

The dependency check is performed in __post_init__, and a descriptive ImportError is raised if the package is not available.

Internal Components

Sentence Splitting

Both the reference and response texts are split into sentences using a simple delimiter of "'. " (period followed by space):

reference_sentences = reference.split(". ")
response_sentences = response.split(". ")

Score Computation

The sacrebleu corpus_bleu function is called with the response sentences as hypotheses and the reference sentences as individual references:

reference = [[reference] for reference in reference_sentences]
response = response_sentences
score = self.corpus_bleu(response, reference, **self.kwargs).score / 100

Usage Examples

Basic Usage

from ragas.metrics import BleuScore
from ragas import evaluate
from datasets import Dataset

data = {
    "response": ["The cat sat on the mat."],
    "reference": ["The cat is sitting on the mat."],
}
dataset = Dataset.from_dict(data)

results = evaluate(dataset, metrics=[BleuScore()])
print(results)

With Custom Parameters

from ragas.metrics import BleuScore
from ragas.dataset_schema import SingleTurnSample

# Pass additional sacrebleu options
bleu = BleuScore(kwargs={"smooth_method": "floor", "smooth_value": 0.1})

sample = SingleTurnSample(
    reference="The sun is powered by nuclear fusion.",
    response="Nuclear fusion powers the sun.",
)

Related Pages

Environment:Vibrantlabsai_Ragas_Python_3_9_Core_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment