Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Unslothai Unsloth AIME Evaluation

From Leeroopedia


Knowledge Sources
Domains Evaluation, NLP, Reinforcement_Learning
Last Updated 2026-02-07 00:00 GMT

Overview

An evaluation methodology that measures mathematical reasoning capability using American Invitational Mathematics Examination (AIME) competition problems with Pass@K sampling.

Description

AIME evaluation provides a standardized benchmark for assessing how well a language model can perform multi-step mathematical reasoning. The evaluation uses real AIME competition problems (2024, 2025-I, 2025-II) and measures Pass@K accuracy: for each problem, K completions are sampled, and the problem is considered solved if any of the K samples produces the correct answer.

Key aspects:

  1. Pass@K Sampling: Multiple completions per problem reduce variance and capture the model's best-case capability.
  2. Answer Extraction: Regex-based extraction of numerical answers from model completions (AIME answers are always integers 000-999).
  3. Per-Dataset Breakdown: Results are broken down by source dataset (AIME 2024, 2025-I, 2025-II) for detailed analysis.
  4. vLLM Integration: Uses vLLM SamplingParams for efficient batched generation during evaluation.

Usage

Use this principle to evaluate reasoning models after GRPO or SFT training on mathematical datasets. Particularly useful as a checkpoint metric during RL training to track reasoning improvement over time.

Theoretical Basis

Pass@K estimates the probability that at least one of K samples is correct:

Failed to parse (syntax error): {\displaystyle \text{Pass@K} = 1 - \frac{\binom{n - c}{K}}{\binom{n}{K}} }

Where n is total samples and c is correct samples. In practice, with exactly K samples per problem:

# Abstract Pass@K evaluation
for problem in aime_problems:
    completions = model.generate(problem, n=K, temperature=0.3)
    answers = [extract_answer(c) for c in completions]
    is_correct = any(a == problem.ground_truth for a in answers)
    # Pass@K = fraction of problems with at least one correct sample

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment