Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft LoRA NLG Eval Script

From Leeroopedia
Revision as of 15:43, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_LoRA_NLG_Eval_Script.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Overview

NLG_Eval_Script is the evaluation wrapper that computes automatic NLG metrics (BLEU, METEOR, TER, chrF++, BERTScore, BLEURT) on decoded prediction and reference files. It orchestrates multiple external metric implementations (Perl scripts, Java JARs, Python libraries) through a unified Python interface.

Type

Wrapper Doc

Source

  • examples/NLG/eval/eval.py (lines 270-364)

CLI Signature

python eval/eval.py \
    -R <reference_path> -H <hypothesis_path> \
    -nr <num_refs> -m bleu,meteor,ter,chrf++,bert,bleurt \
    [-lng <language>] [-nc <ncorder>] [-nw <nworder>] [-b <beta>]

Argument reference:

Argument Type Default Description
-R / --reference str required Path to reference file(s). For multi-reference, this is the base path (files reference0, reference1, ... are expected).
-H / --hypothesis str required Path to hypothesis (prediction) file
-nr / --num_refs int 4 Number of reference files
-m / --metrics str bleu,meteor,ter,chrf++,bert,bleurt Comma-separated list of metrics to compute
-lng / --language str en Evaluation language
-nc / --ncorder int 6 chrF character n-gram order
-nw / --nworder int 2 chrF word n-gram order
-b / --beta float 2.0 chrF beta parameter

Key Internal Function

run()

def run(refs_path, hyps_path, num_refs, lng='en',
        metrics='bleu,meteor,chrf++,ter,bert,bleurt',
        ncorder=6, nworder=2, beta=2):

Returns: dict containing metric scores. Possible keys:

Key Type Description
'bleu' float Multi-BLEU score (Perl script)
'bleu_nltk' float NLTK corpus BLEU with smoothing
'meteor' float METEOR score (Java JAR)
'chrf++' float chrF++ total F-score
'ter' float Translation Edit Rate
'bert_precision' float BERTScore precision
'bert_recall' float BERTScore recall
'bert_f1' float BERTScore F1
'bleurt' float BLEURT score (English only)

The function first calls parse(refs_path, hyps_path, num_refs, lng) to load and tokenize all references and hypotheses, then dispatches to individual metric functions based on the requested metrics list.

Individual Metric Functions

  • bleu_score(refs_path, hyps_path, num_refs) -- Calls perl metrics/multi-bleu-detok.perl via subprocess.
  • bleu_nltk(references, hypothesis) -- Computes NLTK corpus BLEU with Method 3 smoothing.
  • meteor_score(references, hypothesis, num_refs, lng) -- Calls java -jar metrics/meteor-1.5/meteor-1.5.jar via subprocess.
  • chrF_score(references, hypothesis, num_refs, nworder, ncorder, beta) -- Uses internal computeChrF function.
  • ter_score(references, hypothesis, num_refs) -- Uses the pyter library to compute TER for each hypothesis-reference pair, taking the minimum TER across references.
  • bert_score_(references, hypothesis, lng) -- Uses the bert_score library's score() function.
  • bleurt(references, hypothesis, num_refs, checkpoint) -- Uses the BLEURT scorer with a pretrained checkpoint (metrics/bleurt/bleurt-base-128).

External Dependencies

Dependency Type Purpose
nltk Python package BLEU computation, word tokenization
pyter Python package TER (Translation Edit Rate) computation
bert_score Python package BERTScore computation
metrics/bleurt/ Python package (local) BLEURT learned metric
metrics/multi-bleu-detok.perl Perl script Standard multi-reference BLEU
metrics/meteor-1.5/meteor-1.5.jar Java JAR METEOR metric computation
razdel Python package Russian tokenization (for lng=ru)
tabulate Python package Pretty-printing results table

These external tools can be downloaded using bash eval/download_evalscript.sh.

Input / Output

Direction Description
Input
  • Reference text file(s) (single file for E2E, or numbered files reference0...reference{N-1} for WebNLG/DART)
  • Hypothesis text file (one prediction per line)
Output Dictionary of metric scores (printed as a formatted table via tabulate, also available as a return value from run())

Example

# Evaluate E2E predictions with BLEU, METEOR, and TER
python eval/eval.py \
    -R output/e2e/references \
    -H output/e2e/predictions.txt \
    -nr 4 \
    -m bleu,meteor,ter

# Example output:
#   BLEU    BLEU NLTK    METEOR    TER
# ------  ----------  --------  -----
#  68.2        0.69      0.46    0.41

Metadata

Field Value
Source microsoft/LoRA
Type Wrapper Doc
Last Updated 2026-02-10

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment