Implementation:Microsoft LoRA NLG Eval Script

Overview

NLG_Eval_Script is the evaluation wrapper that computes automatic NLG metrics (BLEU, METEOR, TER, chrF++, BERTScore, BLEURT) on decoded prediction and reference files. It orchestrates multiple external metric implementations (Perl scripts, Java JARs, Python libraries) through a unified Python interface.

Type

Wrapper Doc

Source

examples/NLG/eval/eval.py (lines 270-364)

CLI Signature

python eval/eval.py \
    -R <reference_path> -H <hypothesis_path> \
    -nr <num_refs> -m bleu,meteor,ter,chrf++,bert,bleurt \
    [-lng <language>] [-nc <ncorder>] [-nw <nworder>] [-b <beta>]

Argument reference:

Argument	Type	Default	Description
`-R / --reference`	str	required	Path to reference file(s). For multi-reference, this is the base path (files `reference0`, `reference1`, ... are expected).
`-H / --hypothesis`	str	required	Path to hypothesis (prediction) file
`-nr / --num_refs`	int	4	Number of reference files
`-m / --metrics`	str	bleu,meteor,ter,chrf++,bert,bleurt	Comma-separated list of metrics to compute
`-lng / --language`	str	en	Evaluation language
`-nc / --ncorder`	int	6	chrF character n-gram order
`-nw / --nworder`	int	2	chrF word n-gram order
`-b / --beta`	float	2.0	chrF beta parameter

Key Internal Function

run()

def run(refs_path, hyps_path, num_refs, lng='en',
        metrics='bleu,meteor,chrf++,ter,bert,bleurt',
        ncorder=6, nworder=2, beta=2):

Returns: dict containing metric scores. Possible keys:

Key	Type	Description
`'bleu'`	float	Multi-BLEU score (Perl script)
`'bleu_nltk'`	float	NLTK corpus BLEU with smoothing
`'meteor'`	float	METEOR score (Java JAR)
`'chrf++'`	float	chrF++ total F-score
`'ter'`	float	Translation Edit Rate
`'bert_precision'`	float	BERTScore precision
`'bert_recall'`	float	BERTScore recall
`'bert_f1'`	float	BERTScore F1
`'bleurt'`	float	BLEURT score (English only)

The function first calls parse(refs_path, hyps_path, num_refs, lng) to load and tokenize all references and hypotheses, then dispatches to individual metric functions based on the requested metrics list.

Individual Metric Functions

bleu_score(refs_path, hyps_path, num_refs) -- Calls perl metrics/multi-bleu-detok.perl via subprocess.
bleu_nltk(references, hypothesis) -- Computes NLTK corpus BLEU with Method 3 smoothing.
meteor_score(references, hypothesis, num_refs, lng) -- Calls java -jar metrics/meteor-1.5/meteor-1.5.jar via subprocess.
chrF_score(references, hypothesis, num_refs, nworder, ncorder, beta) -- Uses internal computeChrF function.
ter_score(references, hypothesis, num_refs) -- Uses the pyter library to compute TER for each hypothesis-reference pair, taking the minimum TER across references.
bert_score_(references, hypothesis, lng) -- Uses the bert_score library's score() function.
bleurt(references, hypothesis, num_refs, checkpoint) -- Uses the BLEURT scorer with a pretrained checkpoint (metrics/bleurt/bleurt-base-128).

External Dependencies

Dependency	Type	Purpose
nltk	Python package	BLEU computation, word tokenization
pyter	Python package	TER (Translation Edit Rate) computation
bert_score	Python package	BERTScore computation
metrics/bleurt/	Python package (local)	BLEURT learned metric
metrics/multi-bleu-detok.perl	Perl script	Standard multi-reference BLEU
metrics/meteor-1.5/meteor-1.5.jar	Java JAR	METEOR metric computation
razdel	Python package	Russian tokenization (for `lng=ru`)
tabulate	Python package	Pretty-printing results table

These external tools can be downloaded using bash eval/download_evalscript.sh.

Input / Output

Direction	Description
Input	Reference text file(s) (single file for E2E, or numbered files `reference0`...`reference{N-1}` for WebNLG/DART) Hypothesis text file (one prediction per line)
Output	Dictionary of metric scores (printed as a formatted table via `tabulate`, also available as a return value from `run()`)

Example

# Evaluate E2E predictions with BLEU, METEOR, and TER
python eval/eval.py \
    -R output/e2e/references \
    -H output/e2e/predictions.txt \
    -nr 4 \
    -m bleu,meteor,ter

# Example output:
#   BLEU    BLEU NLTK    METEOR    TER
# ------  ----------  --------  -----
#  68.2        0.69      0.46    0.41

Metadata

Field	Value
Source	microsoft/LoRA
Type	Wrapper Doc
Last Updated	2026-02-10

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment