Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Unslothai Unsloth Evaluate Model AIME

From Leeroopedia


Knowledge Sources
Domains Evaluation, NLP, Reinforcement_Learning
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for evaluating mathematical reasoning on AIME benchmark datasets with Pass@K sampling provided by the Unsloth test utilities.

Description

evaluate_model_aime runs batched inference on combined AIME 2024 + AIME 2025-I + AIME 2025-II datasets. For each problem, it generates K completions using vLLM SamplingParams, extracts numerical answers via regex, and computes Pass@K accuracy. Results include per-dataset breakdowns and token usage statistics.

Usage

Call after GRPO training to measure reasoning improvement. Requires a vLLM-capable model (loaded with fast_inference=True) or a model path for vLLM offline inference.

Code Reference

Source Location

  • Repository: unsloth
  • File: tests/utils/aime_eval.py
  • Lines: L195-545

Signature

def evaluate_model_aime(
    model,
    tokenizer,
    model_type = "base",
    lora_request = None,
    temperature = 0.3,
    n_sampling = 8,
    max_tokens = 32768,
    top_p = 0.95,
    seed = 0,
) -> dict:
    """
    Evaluate model on combined AIME dataset with official configuration.

    Args:
        model: vLLM LLM instance or model path string.
        tokenizer: HuggingFace tokenizer.
        model_type (str): Label for logging ("base", "sft", "grpo").
        lora_request: Optional vLLM LoRA adapter request.
        temperature (float): Sampling temperature. Default 0.3.
        n_sampling (int): Pass@K samples per question. Default 8.
        max_tokens (int): Max generation tokens. Default 32768.
        top_p (float): Top-p sampling. Default 0.95.
        seed (int): Random seed. Default 0.

    Returns:
        Dict with Pass@K accuracy, per-source breakdowns, token usage.
    """

Import

from tests.utils.aime_eval import evaluate_model_aime

I/O Contract

Inputs

Name Type Required Description
model LLM or str Yes vLLM model instance or model path
tokenizer PreTrainedTokenizer Yes HuggingFace tokenizer
model_type str No Label for logging (default: "base")
n_sampling int No Pass@K sample count (default: 8)
temperature float No Sampling temperature (default: 0.3)
max_tokens int No Max generation length (default: 32768)

Outputs

Name Type Description
results dict Pass@K accuracy, per-source-dataset breakdowns (AIME 2024, 2025-I, 2025-II), token usage statistics

Usage Examples

Evaluate After GRPO Training

from tests.utils.aime_eval import evaluate_model_aime

# model loaded with fast_inference=True
results = evaluate_model_aime(
    model=model,
    tokenizer=tokenizer,
    model_type="grpo",
    n_sampling=8,
    temperature=0.3,
)

print(f"Pass@8 Accuracy: {results['accuracy']}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment