Implementation:Hiyouga LLaMA Factory SFT Metric

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Evaluation Metrics, Supervised Fine-Tuning, NLP
Last Updated	2026-02-06 19:00 GMT

Overview

This module provides evaluation metrics for supervised fine-tuning, including token-level accuracy, ROUGE scores, and BLEU-4.

Description

The module defines three components: eval_logit_processor reduces logits to argmax predictions for memory efficiency during evaluation; ComputeAccuracy compares predicted token IDs against label IDs (ignoring IGNORE_INDEX padding) and computes mean accuracy; and ComputeSimilarity tokenizes decoded predictions and labels with jieba, computes ROUGE-1/2/L scores via rouge-chinese and BLEU-4 via NLTK. Both metric dataclasses support HuggingFace's batch_eval_metrics pattern with incremental accumulation through the _dump method.

Usage

Use ComputeAccuracy for non-generative evaluation where token-level predictions are compared against ground truth. Use ComputeSimilarity for generative evaluation with predict_with_generate, where decoded text is compared using ROUGE and BLEU metrics. Use eval_logit_processor as preprocess_logits_for_metrics to reduce GPU memory during evaluation.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/train/sft/metric.py
Lines: 1-134

Signature

def eval_logit_processor(
    logits: "torch.Tensor",
    labels: "torch.Tensor",
) -> "torch.Tensor"

@dataclass
class ComputeAccuracy:
    def __call__(
        self,
        eval_preds: "EvalPrediction",
        compute_result: bool = True,
    ) -> Optional[dict[str, float]]

@dataclass
class ComputeSimilarity:
    tokenizer: "PreTrainedTokenizer"

    def __call__(
        self,
        eval_preds: "EvalPrediction",
        compute_result: bool = True,
    ) -> Optional[dict[str, float]]

Import

from llamafactory.train.sft.metric import eval_logit_processor, ComputeAccuracy, ComputeSimilarity

I/O Contract

Inputs

Name	Type	Required	Description
logits (eval_logit_processor)	torch.Tensor	Yes	Model logits of shape (batch_size, seq_len, vocab_size) or a list/tuple thereof
labels (eval_logit_processor)	torch.Tensor	Yes	Label tensor (unused but required by HF API)
eval_preds (ComputeAccuracy)	EvalPrediction	Yes	Contains predictions (token IDs) and label_ids with IGNORE_INDEX for padding
eval_preds (ComputeSimilarity)	EvalPrediction	Yes	Contains predictions and label_ids as token ID arrays for decoding
tokenizer (ComputeSimilarity)	PreTrainedTokenizer	Yes	Tokenizer for decoding predictions and labels into text
compute_result	bool	No	If True (default), returns aggregated metrics; if False, accumulates for batch_eval_metrics

Outputs

Name	Type	Description
eval_logit_processor result	torch.Tensor	Argmax token IDs of shape (batch_size, seq_len)
ComputeAccuracy result	Optional[dict[str, float]]	Dictionary with "accuracy" key (mean over all non-padding tokens)
ComputeSimilarity result	Optional[dict[str, float]]	Dictionary with "rouge-1", "rouge-2", "rouge-l", "bleu-4" keys (scores in percentage)

Usage Examples

# Using ComputeAccuracy for token-level evaluation
from llamafactory.train.sft.metric import ComputeAccuracy, eval_logit_processor

metric_module = {
    "compute_metrics": ComputeAccuracy(),
    "preprocess_logits_for_metrics": eval_logit_processor,
}

trainer = CustomSeq2SeqTrainer(
    ...,
    **metric_module,
)

# Using ComputeSimilarity for generative evaluation
from llamafactory.train.sft.metric import ComputeSimilarity

metric_module = {
    "compute_metrics": ComputeSimilarity(tokenizer=tokenizer),
}

trainer = CustomSeq2SeqTrainer(
    ...,
    **metric_module,
)
# Returns: {"rouge-1": 45.2, "rouge-2": 22.1, "rouge-l": 40.3, "bleu-4": 18.7}

Related Pages

Hiyouga_LLaMA_Factory_SFT_Trainer - CustomSeq2SeqTrainer that uses these metrics
Hiyouga_LLaMA_Factory_SFT_Workflow - Workflow that configures metric selection based on training arguments
Hiyouga_LLaMA_Factory_RM_Metric - Analogous ComputeAccuracy used for reward model evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment