Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Microsoft LoRA NLG Evaluation Metrics

From Leeroopedia
Revision as of 17:55, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Microsoft_LoRA_NLG_Evaluation_Metrics.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Overview

NLG Evaluation Metrics is the principle of assessing natural language generation quality using a comprehensive suite of automatic metrics. The evaluation framework computes multiple complementary scores -- spanning n-gram overlap, edit distance, character-level matching, and neural embedding similarity -- to provide a holistic view of generation quality from different linguistic perspectives.

Description

BLEU (Bilingual Evaluation Understudy)

BLEU measures n-gram precision between the hypothesis and reference texts. It computes the fraction of n-grams (for n=1,2,3,4) in the hypothesis that also appear in the references, combined using a geometric mean. A brevity penalty discourages outputs shorter than the references.

The evaluation framework computes two BLEU variants:

  • Multi-BLEU -- Perl-based implementation (multi-bleu-detok.perl) following the original BLEU paper. This is the standard reporting metric.
  • NLTK BLEU -- Python-based corpus BLEU from the NLTK library with Method 3 smoothing, used as a secondary reference.

METEOR (Metric for Evaluation of Translation with Explicit ORdering)

METEOR computes a harmonic mean of unigram precision and recall between hypothesis and reference, with recall weighted more heavily. Unlike BLEU, METEOR considers:

  • Exact matches -- Direct word overlap.
  • Stemmed matches -- Words sharing the same stem.
  • Synonym matches -- Words that are synonyms via WordNet.
  • Word order -- A penalty for word order differences (fragmentation penalty).

METEOR typically correlates better with human judgments than BLEU alone, particularly for single-reference evaluation.

TER (Translation Edit Rate)

TER measures the minimum number of edits (insertions, deletions, substitutions, and shifts of contiguous word sequences) needed to transform the hypothesis into the reference, normalized by the reference length. Lower TER values indicate better quality. Unlike BLEU (which is precision-based), TER captures edit-based similarity and is particularly sensitive to word order errors.

chrF++ (Character n-gram F-score)

chrF++ extends character-level n-gram matching with optional word-level n-grams. It computes:

  • Character n-gram precision and recall (default order: 6).
  • Word n-gram precision and recall (default order: 2).
  • F-score using a configurable beta parameter (default: 2, weighting recall more heavily).

chrF++ is robust to morphological variation and does not require tokenization, making it useful for morphologically rich languages and as a complement to word-level metrics.

BERTScore

BERTScore uses contextual embeddings from a pretrained BERT model to compute token-level similarity between hypothesis and reference. For each token in the hypothesis, the most similar token in the reference is found (and vice versa), producing:

  • Precision -- Average maximum cosine similarity for hypothesis tokens.
  • Recall -- Average maximum cosine similarity for reference tokens.
  • F1 -- Harmonic mean of precision and recall.

BERTScore captures semantic similarity that n-gram metrics miss, such as paraphrase recognition and synonym handling.

BLEURT (Bilingual Evaluation Understudy with Representations from Transformers)

BLEURT is a learned evaluation metric that fine-tunes a BERT model on human quality judgments. It takes a (reference, hypothesis) pair as input and outputs a scalar quality score. BLEURT captures nuanced quality distinctions that rule-based metrics cannot, as it has been trained to mimic human evaluation behavior.

Theoretical Basis

No single metric perfectly captures all dimensions of NLG quality. The multi-metric approach addresses this by combining:

  • Surface-level overlap (BLEU, chrF++) -- Measures lexical fidelity.
  • Edit-based distance (TER) -- Measures structural similarity.
  • Semantic matching (METEOR, BERTScore) -- Captures meaning preservation.
  • Learned quality (BLEURT) -- Approximates human judgment.

Research has shown that the correlation between individual metrics and human judgments varies by task and domain. Reporting multiple metrics allows consumers to assess whether improvements are consistent across evaluation dimensions.

Metadata

Field Value
Source microsoft/LoRA
Domains Evaluation, NLG
Type External Tool Doc
Last Updated 2026-02-10

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment