Implementation:Microsoft LoRA NLG Eval Script
Overview
NLG_Eval_Script is the evaluation wrapper that computes automatic NLG metrics (BLEU, METEOR, TER, chrF++, BERTScore, BLEURT) on decoded prediction and reference files. It orchestrates multiple external metric implementations (Perl scripts, Java JARs, Python libraries) through a unified Python interface.
Type
Wrapper Doc
Source
examples/NLG/eval/eval.py(lines 270-364)
CLI Signature
python eval/eval.py \
-R <reference_path> -H <hypothesis_path> \
-nr <num_refs> -m bleu,meteor,ter,chrf++,bert,bleurt \
[-lng <language>] [-nc <ncorder>] [-nw <nworder>] [-b <beta>]
Argument reference:
| Argument | Type | Default | Description |
|---|---|---|---|
-R / --reference |
str | required | Path to reference file(s). For multi-reference, this is the base path (files reference0, reference1, ... are expected).
|
-H / --hypothesis |
str | required | Path to hypothesis (prediction) file |
-nr / --num_refs |
int | 4 | Number of reference files |
-m / --metrics |
str | bleu,meteor,ter,chrf++,bert,bleurt | Comma-separated list of metrics to compute |
-lng / --language |
str | en | Evaluation language |
-nc / --ncorder |
int | 6 | chrF character n-gram order |
-nw / --nworder |
int | 2 | chrF word n-gram order |
-b / --beta |
float | 2.0 | chrF beta parameter |
Key Internal Function
run()
def run(refs_path, hyps_path, num_refs, lng='en',
metrics='bleu,meteor,chrf++,ter,bert,bleurt',
ncorder=6, nworder=2, beta=2):
Returns: dict containing metric scores. Possible keys:
| Key | Type | Description |
|---|---|---|
'bleu' |
float | Multi-BLEU score (Perl script) |
'bleu_nltk' |
float | NLTK corpus BLEU with smoothing |
'meteor' |
float | METEOR score (Java JAR) |
'chrf++' |
float | chrF++ total F-score |
'ter' |
float | Translation Edit Rate |
'bert_precision' |
float | BERTScore precision |
'bert_recall' |
float | BERTScore recall |
'bert_f1' |
float | BERTScore F1 |
'bleurt' |
float | BLEURT score (English only) |
The function first calls parse(refs_path, hyps_path, num_refs, lng) to load and tokenize all references and hypotheses, then dispatches to individual metric functions based on the requested metrics list.
Individual Metric Functions
- bleu_score(refs_path, hyps_path, num_refs) -- Calls
perl metrics/multi-bleu-detok.perlvia subprocess. - bleu_nltk(references, hypothesis) -- Computes NLTK corpus BLEU with Method 3 smoothing.
- meteor_score(references, hypothesis, num_refs, lng) -- Calls
java -jar metrics/meteor-1.5/meteor-1.5.jarvia subprocess. - chrF_score(references, hypothesis, num_refs, nworder, ncorder, beta) -- Uses internal
computeChrFfunction. - ter_score(references, hypothesis, num_refs) -- Uses the
pyterlibrary to compute TER for each hypothesis-reference pair, taking the minimum TER across references. - bert_score_(references, hypothesis, lng) -- Uses the
bert_scorelibrary'sscore()function. - bleurt(references, hypothesis, num_refs, checkpoint) -- Uses the BLEURT scorer with a pretrained checkpoint (
metrics/bleurt/bleurt-base-128).
External Dependencies
| Dependency | Type | Purpose |
|---|---|---|
| nltk | Python package | BLEU computation, word tokenization |
| pyter | Python package | TER (Translation Edit Rate) computation |
| bert_score | Python package | BERTScore computation |
| metrics/bleurt/ | Python package (local) | BLEURT learned metric |
| metrics/multi-bleu-detok.perl | Perl script | Standard multi-reference BLEU |
| metrics/meteor-1.5/meteor-1.5.jar | Java JAR | METEOR metric computation |
| razdel | Python package | Russian tokenization (for lng=ru)
|
| tabulate | Python package | Pretty-printing results table |
These external tools can be downloaded using bash eval/download_evalscript.sh.
Input / Output
| Direction | Description |
|---|---|
| Input |
|
| Output | Dictionary of metric scores (printed as a formatted table via tabulate, also available as a return value from run())
|
Example
# Evaluate E2E predictions with BLEU, METEOR, and TER
python eval/eval.py \
-R output/e2e/references \
-H output/e2e/predictions.txt \
-nr 4 \
-m bleu,meteor,ter
# Example output:
# BLEU BLEU NLTK METEOR TER
# ------ ---------- -------- -----
# 68.2 0.69 0.46 0.41
Metadata
| Field | Value |
|---|---|
| Source | microsoft/LoRA |
| Type | Wrapper Doc |
| Last Updated | 2026-02-10 |