Implementation:Hiyouga LLaMA Factory Evaluator

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Evaluation, Benchmarking
Last Updated	2026-02-06 19:00 GMT

Overview

MMLU-style multiple-choice benchmark evaluator that loads a language model, performs batched few-shot inference across subjects, and reports per-category accuracy scores.

Description

The Evaluator class initializes by parsing evaluation arguments, loading the tokenizer and model, and preparing the evaluation template and choice token IDs (A/B/C/D). The eval method iterates over all subjects in a benchmark dataset, formats few-shot examples using EvalTemplate, performs batched inference by extracting logits at choice token positions and applying softmax to determine the predicted answer, then accumulates per-category accuracy using numpy arrays. The batch_inference method operates under @torch.inference_mode() for efficiency. Results are saved as a JSON file and a human-readable log file. The module-level run_eval function provides a convenient entry point.

Usage

Use this evaluator to assess model performance on standardized multiple-choice benchmarks such as MMLU and CMMLU. It is invoked via the CLI evaluation command or the run_eval function, typically after training to measure the quality of the fine-tuned model.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/eval/evaluator.py
Lines: 1-158

Signature

class Evaluator:
    def __init__(self, args: Optional[dict[str, Any]] = None) -> None
    @torch.inference_mode()
    def batch_inference(self, batch_input: dict[str, "torch.Tensor"]) -> list[str]
    def eval(self) -> None
    def _save_results(
        self,
        category_corrects: dict[str, "NDArray"],
        results: dict[str, dict[int, str]]
    ) -> None

def run_eval() -> None

Import

from llamafactory.eval.evaluator import Evaluator, run_eval

I/O Contract

Inputs

Name	Type	Required	Description
args	`Optional[dict[str, Any]]`	No	Optional dictionary of arguments; if None, arguments are parsed from the command line via get_eval_args
eval_args.task	`str`	Yes	Evaluation task identifier in the format {task}_{split} (e.g., "mmlu_test")
eval_args.task_dir	`str`	Yes	Directory or path containing the benchmark dataset files and mapping.json
eval_args.n_shot	`int`	Yes	Number of few-shot examples to include from the training split
eval_args.batch_size	`int`	Yes	Number of examples per inference batch
eval_args.save_dir	None	No	Directory to save results; if None, results are only printed to stdout

Outputs

Name	Type	Description
results.json	`JSON file`	Per-subject predictions mapping example index to predicted choice letter
results.log	`text file`	Per-category accuracy scores formatted as percentages
stdout	`text`	Printed score information with per-category accuracy

Usage Examples

from llamafactory.eval.evaluator import Evaluator

# Run evaluation with explicit arguments
evaluator = Evaluator(args={
    "model_name_or_path": "meta-llama/Llama-2-7b-hf",
    "task": "mmlu_test",
    "task_dir": "data/eval",
    "template": "default",
    "n_shot": 5,
    "batch_size": 8,
    "save_dir": "results/mmlu",
    "lang": "en",
})
evaluator.eval()

# Run from CLI entry point (parses command-line args)
from llamafactory.eval.evaluator import run_eval
run_eval()

Related Pages

Hiyouga_LLaMA_Factory_Data_Args - Provides data-related arguments used during evaluation template construction
Hiyouga_LLaMA_Factory_Generating_Args - Generation parameters that may apply during evaluation inference
Hiyouga_LLaMA_Factory_Logging - Logging framework used throughout the evaluation process

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment