Implementation:EvolvingLMMs Lab Lmms eval aime utils

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Mathematical Reasoning, Answer Extraction, Evaluation Metrics
Last Updated	2026-02-14 00:00 GMT

Overview

Utilities for evaluating AIME (American Invitational Mathematics Examination) tasks with answer extraction and pass@k metrics.

Description

This module provides specialized utilities for evaluating mathematical reasoning tasks, particularly AIME problems. It includes prompt templates with configurable thinking steps/tokens, answer extraction from model outputs using regex patterns and LLM-based matching, processing of LaTeX boxed answers, multiple sampling evaluation metrics (coverage@k, majority@k, average@k), and integration with GPT-4o-mini for answer equivalence checking. The code handles various answer formats including LaTeX, numeric values, and text responses.

Usage

Use this module when evaluating mathematical reasoning tasks that require extracting answers from verbose model outputs, comparing answers with tolerance for formatting differences (e.g., "2/3" vs "-2/(-3)"), computing pass@k metrics from multiple samples, or handling LaTeX mathematical notation in answers.

Code Reference

Source Location

Repository: EvolvingLMMs_Lab_Lmms_eval
File: lmms_eval/tasks/aime/utils.py
Lines: 1-350

Signature

# Query templates (configured via environment variables)
QUERY_TEMPLATE: str  # Configured via PROMPTSTEP, PROMPTTOKEN, PROMPTLONG, PROMPTSHORT

ANSWER_PATTERN: str = r"(?i)Answer\s*:\s*(.*)"

EXTRACTION_TEMPLATE_IDX: str  # Template for LLM-based answer matching

def doc_to_text(doc: dict) -> str

def process_docs(dataset: datasets.Dataset) -> datasets.Dataset

def process_results(doc: dict, results: List[str]) -> Dict[str, int]

def last_boxed_only_string(string: str) -> Optional[str]

def remove_boxed(s: str) -> str

def extract_answer_idx(sampler, options: List[str], attempt: str) -> str

class ChatCompletionSampler:
    def __init__(
        self,
        model: str = "gpt-4o-mini",
        system_message: str | None = None,
        temperature: float = 0.5,
        max_tokens: int = 1024,
    )

    def __call__(self, message_list) -> str

Import

from lmms_eval.tasks.aime.utils import (
    doc_to_text,
    process_docs,
    process_results,
    ChatCompletionSampler
)

I/O Contract

Inputs

Name	Type	Required	Description
doc	dict	Yes	Document with 'problem', 'solution', 'answer' keys
results	List[str]	Yes	Model-generated answers (multiple for pass@k metrics)
dataset	datasets.Dataset	Yes	HuggingFace dataset to process

Outputs

Name	Type	Description
processed_doc	dict	Standardized document with required fields
metrics	Dict[str, int]	Evaluation metrics including exact_match, cov@k, maj@k, avg@k
query_text	str	Formatted question text with prompt template

Usage Examples

Basic Document Processing

from lmms_eval.tasks.aime.utils import doc_to_text, process_docs
import datasets

# Load dataset
dataset = datasets.load_dataset("path/to/aime")

# Process documents
processed = process_docs(dataset)

# Convert to query text
doc = processed[0]
query = doc_to_text(doc)
print(query)
# Output: "What is 2+2?\n\nThink for up to 10 steps."

Environment-Based Prompt Configuration

# Configure thinking steps
export PROMPTSTEP=10
python evaluate.py --tasks aime

# Or configure token budget
export PROMPTTOKEN=2000
python evaluate.py --tasks aime

# Or use long/short thinking
export PROMPTLONG=1
# or
export PROMPTSHORT=1

Evaluating Multiple Samples (Pass@k)

from lmms_eval.tasks.aime.utils import process_results
import os

# Set up GPT-4o-mini processor for answer extraction
os.environ["PROCESSOR"] = "gpt-4o-mini"
os.environ["OPENAI_API_KEY"] = "your-key"

doc = {
    "problem": "Solve for x: 2x + 4 = 10",
    "answer": "3",
    "solution": "Subtract 4: 2x = 6, divide by 2: x = 3"
}

# Multiple model responses
results = [
    "The answer is x=3",
    "I think x equals 2.9999...",
    "x = 3.0",
    "Hmm, maybe x=2?",
    # ... 60 more samples for pass@64
]

metrics = process_results(doc, results)
print(metrics)
# Output:
# {
#     'exact_match': 1,  # First result correct
#     'exact_matches': [1, 0, 1, 0, ...],
#     'cov@2': 1,  # At least one correct in first 2
#     'maj@2': 1,  # Majority vote correct
#     'avg@2': 0.5,  # 50% correct
#     'cov@4': 1,
#     'maj@4': 1,
#     'avg@4': 0.5,
#     # ... up to @64
#     'extracted_answers': ['3', '2.9999...', '3.0', '2', ...]
# }

LaTeX Answer Extraction

from lmms_eval.tasks.aime.utils import last_boxed_only_string, remove_boxed

response = r"""
Let me solve this step by step:
... lots of work ...
Therefore, the answer is $\boxed{42}$.
"""

boxed = last_boxed_only_string(response)
print(boxed)  # Output: "\\boxed{42}"

answer = remove_boxed(boxed)
print(answer)  # Output: "42"

LLM-Based Answer Matching

from lmms_eval.tasks.aime.utils import ChatCompletionSampler, extract_answer_idx

sampler = ChatCompletionSampler(model="gpt-4o-mini")

options = ["2/3", "3/2", "1/2"]
attempt = "The answer is -2 * (-1/3)"

# Returns index (1-based) or -1 if no match
idx = extract_answer_idx(sampler, options, attempt)
print(idx)  # Output: "1" (matches first option after simplification)

Implementation Details

Answer Extraction Strategy

Boxed Extraction: First tries to find \boxed{...} in the response
Pattern Matching: Falls back to regex pattern matching "Answer: ..."
Normalization: Converts numeric strings (e.g., "023" → "23")
LLM Matching: Uses GPT-4o-mini to check equivalence with known answers
Fallback: Returns raw extracted text if all else fails

Pass@k Metrics

cov@k (coverage): 1 if any of first k samples is correct, else 0
maj@k (majority): 1 if most common answer in first k matches ground truth
avg@k (average): Fraction of first k samples that are correct

EXTRACTION_TEMPLATE_IDX

This template includes 10+ few-shot examples teaching GPT-4o-mini to match answers with tolerance for:

Formatting differences (spaces, LaTeX)
Unit differences (cents vs dollars)
Trivial simplifications (2/(-3) vs -2/3)
Order differences in multi-value answers
Base notation (2516_8 vs 2516)

Error Handling

Missing solutions: Prints warning and continues
Out-of-bounds indices: Logs warning and leaves answer unchanged
Non-integer indices: Logs warning and leaves answer unchanged
OpenAI API errors: Catches BadRequestError and returns empty string
Rate limits: Exponential backoff with 2^trial second delays

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment