Implementation:Hiyouga LLaMA Factory Evaluator
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Benchmarking |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
MMLU-style multiple-choice benchmark evaluator that loads a language model, performs batched few-shot inference across subjects, and reports per-category accuracy scores.
Description
The Evaluator class initializes by parsing evaluation arguments, loading the tokenizer and model, and preparing the evaluation template and choice token IDs (A/B/C/D). The eval method iterates over all subjects in a benchmark dataset, formats few-shot examples using EvalTemplate, performs batched inference by extracting logits at choice token positions and applying softmax to determine the predicted answer, then accumulates per-category accuracy using numpy arrays. The batch_inference method operates under @torch.inference_mode() for efficiency. Results are saved as a JSON file and a human-readable log file. The module-level run_eval function provides a convenient entry point.
Usage
Use this evaluator to assess model performance on standardized multiple-choice benchmarks such as MMLU and CMMLU. It is invoked via the CLI evaluation command or the run_eval function, typically after training to measure the quality of the fine-tuned model.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/eval/evaluator.py
- Lines: 1-158
Signature
class Evaluator:
def __init__(self, args: Optional[dict[str, Any]] = None) -> None
@torch.inference_mode()
def batch_inference(self, batch_input: dict[str, "torch.Tensor"]) -> list[str]
def eval(self) -> None
def _save_results(
self,
category_corrects: dict[str, "NDArray"],
results: dict[str, dict[int, str]]
) -> None
def run_eval() -> None
Import
from llamafactory.eval.evaluator import Evaluator, run_eval
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args | Optional[dict[str, Any]] |
No | Optional dictionary of arguments; if None, arguments are parsed from the command line via get_eval_args |
| eval_args.task | str |
Yes | Evaluation task identifier in the format {task}_{split} (e.g., "mmlu_test") |
| eval_args.task_dir | str |
Yes | Directory or path containing the benchmark dataset files and mapping.json |
| eval_args.n_shot | int |
Yes | Number of few-shot examples to include from the training split |
| eval_args.batch_size | int |
Yes | Number of examples per inference batch |
| eval_args.save_dir | None | No | Directory to save results; if None, results are only printed to stdout |
Outputs
| Name | Type | Description |
|---|---|---|
| results.json | JSON file |
Per-subject predictions mapping example index to predicted choice letter |
| results.log | text file |
Per-category accuracy scores formatted as percentages |
| stdout | text |
Printed score information with per-category accuracy |
Usage Examples
from llamafactory.eval.evaluator import Evaluator
# Run evaluation with explicit arguments
evaluator = Evaluator(args={
"model_name_or_path": "meta-llama/Llama-2-7b-hf",
"task": "mmlu_test",
"task_dir": "data/eval",
"template": "default",
"n_shot": 5,
"batch_size": 8,
"save_dir": "results/mmlu",
"lang": "en",
})
evaluator.eval()
# Run from CLI entry point (parses command-line args)
from llamafactory.eval.evaluator import run_eval
run_eval()
Related Pages
- Hiyouga_LLaMA_Factory_Data_Args - Provides data-related arguments used during evaluation template construction
- Hiyouga_LLaMA_Factory_Generating_Args - Generation parameters that may apply during evaluation inference
- Hiyouga_LLaMA_Factory_Logging - Logging framework used throughout the evaluation process