Implementation:Unslothai Unsloth Evaluate Model AIME
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, NLP, Reinforcement_Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for evaluating mathematical reasoning on AIME benchmark datasets with Pass@K sampling provided by the Unsloth test utilities.
Description
evaluate_model_aime runs batched inference on combined AIME 2024 + AIME 2025-I + AIME 2025-II datasets. For each problem, it generates K completions using vLLM SamplingParams, extracts numerical answers via regex, and computes Pass@K accuracy. Results include per-dataset breakdowns and token usage statistics.
Usage
Call after GRPO training to measure reasoning improvement. Requires a vLLM-capable model (loaded with fast_inference=True) or a model path for vLLM offline inference.
Code Reference
Source Location
- Repository: unsloth
- File: tests/utils/aime_eval.py
- Lines: L195-545
Signature
def evaluate_model_aime(
model,
tokenizer,
model_type = "base",
lora_request = None,
temperature = 0.3,
n_sampling = 8,
max_tokens = 32768,
top_p = 0.95,
seed = 0,
) -> dict:
"""
Evaluate model on combined AIME dataset with official configuration.
Args:
model: vLLM LLM instance or model path string.
tokenizer: HuggingFace tokenizer.
model_type (str): Label for logging ("base", "sft", "grpo").
lora_request: Optional vLLM LoRA adapter request.
temperature (float): Sampling temperature. Default 0.3.
n_sampling (int): Pass@K samples per question. Default 8.
max_tokens (int): Max generation tokens. Default 32768.
top_p (float): Top-p sampling. Default 0.95.
seed (int): Random seed. Default 0.
Returns:
Dict with Pass@K accuracy, per-source breakdowns, token usage.
"""
Import
from tests.utils.aime_eval import evaluate_model_aime
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | LLM or str | Yes | vLLM model instance or model path |
| tokenizer | PreTrainedTokenizer | Yes | HuggingFace tokenizer |
| model_type | str | No | Label for logging (default: "base") |
| n_sampling | int | No | Pass@K sample count (default: 8) |
| temperature | float | No | Sampling temperature (default: 0.3) |
| max_tokens | int | No | Max generation length (default: 32768) |
Outputs
| Name | Type | Description |
|---|---|---|
| results | dict | Pass@K accuracy, per-source-dataset breakdowns (AIME 2024, 2025-I, 2025-II), token usage statistics |
Usage Examples
Evaluate After GRPO Training
from tests.utils.aime_eval import evaluate_model_aime
# model loaded with fast_inference=True
results = evaluate_model_aime(
model=model,
tokenizer=tokenizer,
model_type="grpo",
n_sampling=8,
temperature=0.3,
)
print(f"Pass@8 Accuracy: {results['accuracy']}")