Implementation:Unslothai Unsloth Evaluate Model AIME

Knowledge Sources	Unsloth
Domains	Evaluation, NLP, Reinforcement_Learning
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for evaluating mathematical reasoning on AIME benchmark datasets with Pass@K sampling provided by the Unsloth test utilities.

Description

evaluate_model_aime runs batched inference on combined AIME 2024 + AIME 2025-I + AIME 2025-II datasets. For each problem, it generates K completions using vLLM SamplingParams, extracts numerical answers via regex, and computes Pass@K accuracy. Results include per-dataset breakdowns and token usage statistics.

Usage

Call after GRPO training to measure reasoning improvement. Requires a vLLM-capable model (loaded with fast_inference=True) or a model path for vLLM offline inference.

Code Reference

Source Location

Repository: unsloth
File: tests/utils/aime_eval.py
Lines: L195-545

Signature

def evaluate_model_aime(
    model,
    tokenizer,
    model_type = "base",
    lora_request = None,
    temperature = 0.3,
    n_sampling = 8,
    max_tokens = 32768,
    top_p = 0.95,
    seed = 0,
) -> dict:
    """
    Evaluate model on combined AIME dataset with official configuration.

    Args:
        model: vLLM LLM instance or model path string.
        tokenizer: HuggingFace tokenizer.
        model_type (str): Label for logging ("base", "sft", "grpo").
        lora_request: Optional vLLM LoRA adapter request.
        temperature (float): Sampling temperature. Default 0.3.
        n_sampling (int): Pass@K samples per question. Default 8.
        max_tokens (int): Max generation tokens. Default 32768.
        top_p (float): Top-p sampling. Default 0.95.
        seed (int): Random seed. Default 0.

    Returns:
        Dict with Pass@K accuracy, per-source breakdowns, token usage.
    """

Import

from tests.utils.aime_eval import evaluate_model_aime

I/O Contract

Inputs

Name	Type	Required	Description
model	LLM or str	Yes	vLLM model instance or model path
tokenizer	PreTrainedTokenizer	Yes	HuggingFace tokenizer
model_type	str	No	Label for logging (default: "base")
n_sampling	int	No	Pass@K sample count (default: 8)
temperature	float	No	Sampling temperature (default: 0.3)
max_tokens	int	No	Max generation length (default: 32768)

Outputs

Name	Type	Description
results	dict	Pass@K accuracy, per-source-dataset breakdowns (AIME 2024, 2025-I, 2025-II), token usage statistics

Usage Examples

Evaluate After GRPO Training

from tests.utils.aime_eval import evaluate_model_aime

# model loaded with fast_inference=True
results = evaluate_model_aime(
    model=model,
    tokenizer=tokenizer,
    model_type="grpo",
    n_sampling=8,
    temperature=0.3,
)

print(f"Pass@8 Accuracy: {results['accuracy']}")

Related Pages

Implements Principle

Principle:Unslothai_Unsloth_AIME_Evaluation

Requires Environment

Environment:Unslothai_Unsloth_CUDA_VLLM

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment