Implementation:Hiyouga LLaMA Factory SFT Metric
| Knowledge Sources | |
|---|---|
| Domains | Evaluation Metrics, Supervised Fine-Tuning, NLP |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
This module provides evaluation metrics for supervised fine-tuning, including token-level accuracy, ROUGE scores, and BLEU-4.
Description
The module defines three components: eval_logit_processor reduces logits to argmax predictions for memory efficiency during evaluation; ComputeAccuracy compares predicted token IDs against label IDs (ignoring IGNORE_INDEX padding) and computes mean accuracy; and ComputeSimilarity tokenizes decoded predictions and labels with jieba, computes ROUGE-1/2/L scores via rouge-chinese and BLEU-4 via NLTK. Both metric dataclasses support HuggingFace's batch_eval_metrics pattern with incremental accumulation through the _dump method.
Usage
Use ComputeAccuracy for non-generative evaluation where token-level predictions are compared against ground truth. Use ComputeSimilarity for generative evaluation with predict_with_generate, where decoded text is compared using ROUGE and BLEU metrics. Use eval_logit_processor as preprocess_logits_for_metrics to reduce GPU memory during evaluation.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/train/sft/metric.py
- Lines: 1-134
Signature
def eval_logit_processor(
logits: "torch.Tensor",
labels: "torch.Tensor",
) -> "torch.Tensor"
@dataclass
class ComputeAccuracy:
def __call__(
self,
eval_preds: "EvalPrediction",
compute_result: bool = True,
) -> Optional[dict[str, float]]
@dataclass
class ComputeSimilarity:
tokenizer: "PreTrainedTokenizer"
def __call__(
self,
eval_preds: "EvalPrediction",
compute_result: bool = True,
) -> Optional[dict[str, float]]
Import
from llamafactory.train.sft.metric import eval_logit_processor, ComputeAccuracy, ComputeSimilarity
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| logits (eval_logit_processor) | torch.Tensor | Yes | Model logits of shape (batch_size, seq_len, vocab_size) or a list/tuple thereof |
| labels (eval_logit_processor) | torch.Tensor | Yes | Label tensor (unused but required by HF API) |
| eval_preds (ComputeAccuracy) | EvalPrediction | Yes | Contains predictions (token IDs) and label_ids with IGNORE_INDEX for padding |
| eval_preds (ComputeSimilarity) | EvalPrediction | Yes | Contains predictions and label_ids as token ID arrays for decoding |
| tokenizer (ComputeSimilarity) | PreTrainedTokenizer | Yes | Tokenizer for decoding predictions and labels into text |
| compute_result | bool | No | If True (default), returns aggregated metrics; if False, accumulates for batch_eval_metrics |
Outputs
| Name | Type | Description |
|---|---|---|
| eval_logit_processor result | torch.Tensor | Argmax token IDs of shape (batch_size, seq_len) |
| ComputeAccuracy result | Optional[dict[str, float]] | Dictionary with "accuracy" key (mean over all non-padding tokens) |
| ComputeSimilarity result | Optional[dict[str, float]] | Dictionary with "rouge-1", "rouge-2", "rouge-l", "bleu-4" keys (scores in percentage) |
Usage Examples
# Using ComputeAccuracy for token-level evaluation
from llamafactory.train.sft.metric import ComputeAccuracy, eval_logit_processor
metric_module = {
"compute_metrics": ComputeAccuracy(),
"preprocess_logits_for_metrics": eval_logit_processor,
}
trainer = CustomSeq2SeqTrainer(
...,
**metric_module,
)
# Using ComputeSimilarity for generative evaluation
from llamafactory.train.sft.metric import ComputeSimilarity
metric_module = {
"compute_metrics": ComputeSimilarity(tokenizer=tokenizer),
}
trainer = CustomSeq2SeqTrainer(
...,
**metric_module,
)
# Returns: {"rouge-1": 45.2, "rouge-2": 22.1, "rouge-l": 40.3, "bleu-4": 18.7}
Related Pages
- Hiyouga_LLaMA_Factory_SFT_Trainer - CustomSeq2SeqTrainer that uses these metrics
- Hiyouga_LLaMA_Factory_SFT_Workflow - Workflow that configures metric selection based on training arguments
- Hiyouga_LLaMA_Factory_RM_Metric - Analogous ComputeAccuracy used for reward model evaluation