Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory Evaluator

From Leeroopedia


Knowledge Sources
Domains Evaluation, Benchmarking
Last Updated 2026-02-06 19:00 GMT

Overview

MMLU-style multiple-choice benchmark evaluator that loads a language model, performs batched few-shot inference across subjects, and reports per-category accuracy scores.

Description

The Evaluator class initializes by parsing evaluation arguments, loading the tokenizer and model, and preparing the evaluation template and choice token IDs (A/B/C/D). The eval method iterates over all subjects in a benchmark dataset, formats few-shot examples using EvalTemplate, performs batched inference by extracting logits at choice token positions and applying softmax to determine the predicted answer, then accumulates per-category accuracy using numpy arrays. The batch_inference method operates under @torch.inference_mode() for efficiency. Results are saved as a JSON file and a human-readable log file. The module-level run_eval function provides a convenient entry point.

Usage

Use this evaluator to assess model performance on standardized multiple-choice benchmarks such as MMLU and CMMLU. It is invoked via the CLI evaluation command or the run_eval function, typically after training to measure the quality of the fine-tuned model.

Code Reference

Source Location

Signature

class Evaluator:
    def __init__(self, args: Optional[dict[str, Any]] = None) -> None
    @torch.inference_mode()
    def batch_inference(self, batch_input: dict[str, "torch.Tensor"]) -> list[str]
    def eval(self) -> None
    def _save_results(
        self,
        category_corrects: dict[str, "NDArray"],
        results: dict[str, dict[int, str]]
    ) -> None

def run_eval() -> None

Import

from llamafactory.eval.evaluator import Evaluator, run_eval

I/O Contract

Inputs

Name Type Required Description
args Optional[dict[str, Any]] No Optional dictionary of arguments; if None, arguments are parsed from the command line via get_eval_args
eval_args.task str Yes Evaluation task identifier in the format {task}_{split} (e.g., "mmlu_test")
eval_args.task_dir str Yes Directory or path containing the benchmark dataset files and mapping.json
eval_args.n_shot int Yes Number of few-shot examples to include from the training split
eval_args.batch_size int Yes Number of examples per inference batch
eval_args.save_dir None No Directory to save results; if None, results are only printed to stdout

Outputs

Name Type Description
results.json JSON file Per-subject predictions mapping example index to predicted choice letter
results.log text file Per-category accuracy scores formatted as percentages
stdout text Printed score information with per-category accuracy

Usage Examples

from llamafactory.eval.evaluator import Evaluator

# Run evaluation with explicit arguments
evaluator = Evaluator(args={
    "model_name_or_path": "meta-llama/Llama-2-7b-hf",
    "task": "mmlu_test",
    "task_dir": "data/eval",
    "template": "default",
    "n_shot": 5,
    "batch_size": 8,
    "save_dir": "results/mmlu",
    "lang": "en",
})
evaluator.eval()
# Run from CLI entry point (parses command-line args)
from llamafactory.eval.evaluator import run_eval
run_eval()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment