Principle:Microsoft LoRA NLG Evaluation Metrics

Overview

NLG Evaluation Metrics is the principle of assessing natural language generation quality using a comprehensive suite of automatic metrics. The evaluation framework computes multiple complementary scores -- spanning n-gram overlap, edit distance, character-level matching, and neural embedding similarity -- to provide a holistic view of generation quality from different linguistic perspectives.

Description

BLEU (Bilingual Evaluation Understudy)

BLEU measures n-gram precision between the hypothesis and reference texts. It computes the fraction of n-grams (for n=1,2,3,4) in the hypothesis that also appear in the references, combined using a geometric mean. A brevity penalty discourages outputs shorter than the references.

The evaluation framework computes two BLEU variants:

Multi-BLEU -- Perl-based implementation (multi-bleu-detok.perl) following the original BLEU paper. This is the standard reporting metric.
NLTK BLEU -- Python-based corpus BLEU from the NLTK library with Method 3 smoothing, used as a secondary reference.

METEOR (Metric for Evaluation of Translation with Explicit ORdering)

METEOR computes a harmonic mean of unigram precision and recall between hypothesis and reference, with recall weighted more heavily. Unlike BLEU, METEOR considers:

Exact matches -- Direct word overlap.
Stemmed matches -- Words sharing the same stem.
Synonym matches -- Words that are synonyms via WordNet.
Word order -- A penalty for word order differences (fragmentation penalty).

METEOR typically correlates better with human judgments than BLEU alone, particularly for single-reference evaluation.

TER (Translation Edit Rate)

TER measures the minimum number of edits (insertions, deletions, substitutions, and shifts of contiguous word sequences) needed to transform the hypothesis into the reference, normalized by the reference length. Lower TER values indicate better quality. Unlike BLEU (which is precision-based), TER captures edit-based similarity and is particularly sensitive to word order errors.

chrF++ (Character n-gram F-score)

chrF++ extends character-level n-gram matching with optional word-level n-grams. It computes:

Character n-gram precision and recall (default order: 6).
Word n-gram precision and recall (default order: 2).
F-score using a configurable beta parameter (default: 2, weighting recall more heavily).

chrF++ is robust to morphological variation and does not require tokenization, making it useful for morphologically rich languages and as a complement to word-level metrics.

BERTScore

BERTScore uses contextual embeddings from a pretrained BERT model to compute token-level similarity between hypothesis and reference. For each token in the hypothesis, the most similar token in the reference is found (and vice versa), producing:

Precision -- Average maximum cosine similarity for hypothesis tokens.
Recall -- Average maximum cosine similarity for reference tokens.
F1 -- Harmonic mean of precision and recall.

BERTScore captures semantic similarity that n-gram metrics miss, such as paraphrase recognition and synonym handling.

BLEURT (Bilingual Evaluation Understudy with Representations from Transformers)

BLEURT is a learned evaluation metric that fine-tunes a BERT model on human quality judgments. It takes a (reference, hypothesis) pair as input and outputs a scalar quality score. BLEURT captures nuanced quality distinctions that rule-based metrics cannot, as it has been trained to mimic human evaluation behavior.

Theoretical Basis

No single metric perfectly captures all dimensions of NLG quality. The multi-metric approach addresses this by combining:

Surface-level overlap (BLEU, chrF++) -- Measures lexical fidelity.
Edit-based distance (TER) -- Measures structural similarity.
Semantic matching (METEOR, BERTScore) -- Captures meaning preservation.
Learned quality (BLEURT) -- Approximates human judgment.

Research has shown that the correlation between individual metrics and human judgments varies by task and domain. Reporting multiple metrics allows consumers to assess whether improvements are consistent across evaluation dimensions.

Metadata

Field	Value
Source	microsoft/LoRA
Domains	Evaluation, NLG
Type	External Tool Doc
Last Updated	2026-02-10

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment