Principle:Unslothai Unsloth AIME Evaluation

Knowledge Sources	Unsloth AIME Competition
Domains	Evaluation, NLP, Reinforcement_Learning
Last Updated	2026-02-07 00:00 GMT

Overview

An evaluation methodology that measures mathematical reasoning capability using American Invitational Mathematics Examination (AIME) competition problems with Pass@K sampling.

Description

AIME evaluation provides a standardized benchmark for assessing how well a language model can perform multi-step mathematical reasoning. The evaluation uses real AIME competition problems (2024, 2025-I, 2025-II) and measures Pass@K accuracy: for each problem, K completions are sampled, and the problem is considered solved if any of the K samples produces the correct answer.

Key aspects:

Pass@K Sampling: Multiple completions per problem reduce variance and capture the model's best-case capability.
Answer Extraction: Regex-based extraction of numerical answers from model completions (AIME answers are always integers 000-999).
Per-Dataset Breakdown: Results are broken down by source dataset (AIME 2024, 2025-I, 2025-II) for detailed analysis.
vLLM Integration: Uses vLLM SamplingParams for efficient batched generation during evaluation.

Usage

Use this principle to evaluate reasoning models after GRPO or SFT training on mathematical datasets. Particularly useful as a checkpoint metric during RL training to track reasoning improvement over time.

Theoretical Basis

Pass@K estimates the probability that at least one of K samples is correct:

Failed to parse (syntax error): {\displaystyle \text{Pass@K} = 1 - \frac{\binom{n - c}{K}}{\binom{n}{K}} }

Where $n$ is total samples and $c$ is correct samples. In practice, with exactly K samples per problem:

# Abstract Pass@K evaluation
for problem in aime_problems:
    completions = model.generate(problem, n=K, temperature=0.3)
    answers = [extract_answer(c) for c in completions]
    is_correct = any(a == problem.ground_truth for a in answers)
    # Pass@K = fraction of problems with at least one correct sample

Related Pages

Implemented By

Implementation:Unslothai_Unsloth_Evaluate_Model_AIME

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment