Implementation:Vibrantlabsai Ragas BleuScoreV2
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Metrics |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
BleuScore is a class-based v2 metric that calculates the BLEU (Bilingual Evaluation Understudy) score between reference and response texts using the sacrebleu library, with automatic validation and async design.
Description
The BleuScore metric provides a modern, class-based implementation of the BLEU score for evaluating text generation quality. It inherits from BaseMetric (which combines SimpleBaseMetric and NumericValidator) and does not require any LLM or embedding model.
The scoring algorithm works as follows:
- The reference and response texts are split into sentences using
". "(period followed by space) as the delimiter. - The reference sentences are formatted as a list of single-reference lists (
[[ref1], [ref2], ...]) to conform to the sacrebleu API. - The sacrebleu.corpus_bleu function computes the BLEU score across the sentence pairs.
- The raw BLEU score (0-100 scale) is divided by 100 to normalize it to the 0.0-1.0 range.
Additional keyword arguments can be passed to corpus_bleu via the kwargs parameter, allowing customization of n-gram weights, smoothing methods, and other sacrebleu options.
The metric returns a MetricResult object that behaves like a float but also carries optional metadata such as reasoning traces.
Usage
Use BleuScore when you need a standard n-gram overlap metric for evaluating machine translation or text generation quality. It is particularly useful as a baseline metric that does not require any external AI services. The sacrebleu library must be installed separately (pip install sacrebleu).
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File: src/ragas/metrics/collections/_bleu_score.py
Signature
class BleuScore(BaseMetric):
def __init__(
self,
name: str = "bleu_score",
kwargs: t.Optional[t.Dict[str, t.Any]] = None,
**base_kwargs,
):
Import
from ragas.metrics.collections import BleuScore
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| reference | str | Yes | The reference/ground truth text |
| response | str | Yes | The response text to evaluate against the reference |
| kwargs | Dict[str, Any] | No | Additional arguments passed to sacrebleu.corpus_bleu (e.g., smoothing method, n-gram weights) |
Outputs
| Name | Type | Description |
|---|---|---|
| result | MetricResult | A MetricResult object with a value attribute containing the BLEU score between 0.0 and 1.0
|
Usage Examples
Basic Usage
from ragas.metrics.collections import BleuScore
metric = BleuScore()
result = await metric.ascore(
reference="The capital of France is Paris.",
response="Paris is the capital of France."
)
print(f"BLEU Score: {result.value}")
Batch Evaluation
from ragas.metrics.collections import BleuScore
metric = BleuScore()
results = await metric.abatch_score([
{"reference": "The cat sat on the mat.", "response": "A cat was sitting on a mat."},
{"reference": "It is raining outside.", "response": "It is raining outside."},
])
for i, result in enumerate(results):
print(f"Sample {i}: BLEU = {result.value}")
Custom sacrebleu Arguments
from ragas.metrics.collections import BleuScore
# Pass additional arguments to sacrebleu.corpus_bleu
metric = BleuScore(kwargs={"smooth_method": "exp"})
result = await metric.ascore(
reference="The quick brown fox jumps over the lazy dog.",
response="A fast brown fox leaps over a sleepy dog."
)
print(f"BLEU Score (smoothed): {result.value}")