Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory SFT Metric

From Leeroopedia


Knowledge Sources
Domains Evaluation Metrics, Supervised Fine-Tuning, NLP
Last Updated 2026-02-06 19:00 GMT

Overview

This module provides evaluation metrics for supervised fine-tuning, including token-level accuracy, ROUGE scores, and BLEU-4.

Description

The module defines three components: eval_logit_processor reduces logits to argmax predictions for memory efficiency during evaluation; ComputeAccuracy compares predicted token IDs against label IDs (ignoring IGNORE_INDEX padding) and computes mean accuracy; and ComputeSimilarity tokenizes decoded predictions and labels with jieba, computes ROUGE-1/2/L scores via rouge-chinese and BLEU-4 via NLTK. Both metric dataclasses support HuggingFace's batch_eval_metrics pattern with incremental accumulation through the _dump method.

Usage

Use ComputeAccuracy for non-generative evaluation where token-level predictions are compared against ground truth. Use ComputeSimilarity for generative evaluation with predict_with_generate, where decoded text is compared using ROUGE and BLEU metrics. Use eval_logit_processor as preprocess_logits_for_metrics to reduce GPU memory during evaluation.

Code Reference

Source Location

Signature

def eval_logit_processor(
    logits: "torch.Tensor",
    labels: "torch.Tensor",
) -> "torch.Tensor"

@dataclass
class ComputeAccuracy:
    def __call__(
        self,
        eval_preds: "EvalPrediction",
        compute_result: bool = True,
    ) -> Optional[dict[str, float]]

@dataclass
class ComputeSimilarity:
    tokenizer: "PreTrainedTokenizer"

    def __call__(
        self,
        eval_preds: "EvalPrediction",
        compute_result: bool = True,
    ) -> Optional[dict[str, float]]

Import

from llamafactory.train.sft.metric import eval_logit_processor, ComputeAccuracy, ComputeSimilarity

I/O Contract

Inputs

Name Type Required Description
logits (eval_logit_processor) torch.Tensor Yes Model logits of shape (batch_size, seq_len, vocab_size) or a list/tuple thereof
labels (eval_logit_processor) torch.Tensor Yes Label tensor (unused but required by HF API)
eval_preds (ComputeAccuracy) EvalPrediction Yes Contains predictions (token IDs) and label_ids with IGNORE_INDEX for padding
eval_preds (ComputeSimilarity) EvalPrediction Yes Contains predictions and label_ids as token ID arrays for decoding
tokenizer (ComputeSimilarity) PreTrainedTokenizer Yes Tokenizer for decoding predictions and labels into text
compute_result bool No If True (default), returns aggregated metrics; if False, accumulates for batch_eval_metrics

Outputs

Name Type Description
eval_logit_processor result torch.Tensor Argmax token IDs of shape (batch_size, seq_len)
ComputeAccuracy result Optional[dict[str, float]] Dictionary with "accuracy" key (mean over all non-padding tokens)
ComputeSimilarity result Optional[dict[str, float]] Dictionary with "rouge-1", "rouge-2", "rouge-l", "bleu-4" keys (scores in percentage)

Usage Examples

# Using ComputeAccuracy for token-level evaluation
from llamafactory.train.sft.metric import ComputeAccuracy, eval_logit_processor

metric_module = {
    "compute_metrics": ComputeAccuracy(),
    "preprocess_logits_for_metrics": eval_logit_processor,
}

trainer = CustomSeq2SeqTrainer(
    ...,
    **metric_module,
)

# Using ComputeSimilarity for generative evaluation
from llamafactory.train.sft.metric import ComputeSimilarity

metric_module = {
    "compute_metrics": ComputeSimilarity(tokenizer=tokenizer),
}

trainer = CustomSeq2SeqTrainer(
    ...,
    **metric_module,
)
# Returns: {"rouge-1": 45.2, "rouge-2": 22.1, "rouge-l": 40.3, "bleu-4": 18.7}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment