Principle:Unslothai Unsloth AIME Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, NLP, Reinforcement_Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
An evaluation methodology that measures mathematical reasoning capability using American Invitational Mathematics Examination (AIME) competition problems with Pass@K sampling.
Description
AIME evaluation provides a standardized benchmark for assessing how well a language model can perform multi-step mathematical reasoning. The evaluation uses real AIME competition problems (2024, 2025-I, 2025-II) and measures Pass@K accuracy: for each problem, K completions are sampled, and the problem is considered solved if any of the K samples produces the correct answer.
Key aspects:
- Pass@K Sampling: Multiple completions per problem reduce variance and capture the model's best-case capability.
- Answer Extraction: Regex-based extraction of numerical answers from model completions (AIME answers are always integers 000-999).
- Per-Dataset Breakdown: Results are broken down by source dataset (AIME 2024, 2025-I, 2025-II) for detailed analysis.
- vLLM Integration: Uses vLLM SamplingParams for efficient batched generation during evaluation.
Usage
Use this principle to evaluate reasoning models after GRPO or SFT training on mathematical datasets. Particularly useful as a checkpoint metric during RL training to track reasoning improvement over time.
Theoretical Basis
Pass@K estimates the probability that at least one of K samples is correct:
Failed to parse (syntax error): {\displaystyle \text{Pass@K} = 1 - \frac{\binom{n - c}{K}}{\binom{n}{K}} }
Where is total samples and is correct samples. In practice, with exactly K samples per problem:
# Abstract Pass@K evaluation
for problem in aime_problems:
completions = model.generate(problem, n=K, temperature=0.3)
answers = [extract_answer(c) for c in completions]
is_correct = any(a == problem.ground_truth for a in answers)
# Pass@K = fraction of problems with at least one correct sample