Implementation:EvolvingLMMs Lab Lmms eval aime utils
| Knowledge Sources | |
|---|---|
| Domains | Mathematical Reasoning, Answer Extraction, Evaluation Metrics |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Utilities for evaluating AIME (American Invitational Mathematics Examination) tasks with answer extraction and pass@k metrics.
Description
This module provides specialized utilities for evaluating mathematical reasoning tasks, particularly AIME problems. It includes prompt templates with configurable thinking steps/tokens, answer extraction from model outputs using regex patterns and LLM-based matching, processing of LaTeX boxed answers, multiple sampling evaluation metrics (coverage@k, majority@k, average@k), and integration with GPT-4o-mini for answer equivalence checking. The code handles various answer formats including LaTeX, numeric values, and text responses.
Usage
Use this module when evaluating mathematical reasoning tasks that require extracting answers from verbose model outputs, comparing answers with tolerance for formatting differences (e.g., "2/3" vs "-2/(-3)"), computing pass@k metrics from multiple samples, or handling LaTeX mathematical notation in answers.
Code Reference
Source Location
- Repository: EvolvingLMMs_Lab_Lmms_eval
- File: lmms_eval/tasks/aime/utils.py
- Lines: 1-350
Signature
# Query templates (configured via environment variables)
QUERY_TEMPLATE: str # Configured via PROMPTSTEP, PROMPTTOKEN, PROMPTLONG, PROMPTSHORT
ANSWER_PATTERN: str = r"(?i)Answer\s*:\s*(.*)"
EXTRACTION_TEMPLATE_IDX: str # Template for LLM-based answer matching
def doc_to_text(doc: dict) -> str
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset
def process_results(doc: dict, results: List[str]) -> Dict[str, int]
def last_boxed_only_string(string: str) -> Optional[str]
def remove_boxed(s: str) -> str
def extract_answer_idx(sampler, options: List[str], attempt: str) -> str
class ChatCompletionSampler:
def __init__(
self,
model: str = "gpt-4o-mini",
system_message: str | None = None,
temperature: float = 0.5,
max_tokens: int = 1024,
)
def __call__(self, message_list) -> str
Import
from lmms_eval.tasks.aime.utils import (
doc_to_text,
process_docs,
process_results,
ChatCompletionSampler
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| doc | dict | Yes | Document with 'problem', 'solution', 'answer' keys |
| results | List[str] | Yes | Model-generated answers (multiple for pass@k metrics) |
| dataset | datasets.Dataset | Yes | HuggingFace dataset to process |
Outputs
| Name | Type | Description |
|---|---|---|
| processed_doc | dict | Standardized document with required fields |
| metrics | Dict[str, int] | Evaluation metrics including exact_match, cov@k, maj@k, avg@k |
| query_text | str | Formatted question text with prompt template |
Usage Examples
Basic Document Processing
from lmms_eval.tasks.aime.utils import doc_to_text, process_docs
import datasets
# Load dataset
dataset = datasets.load_dataset("path/to/aime")
# Process documents
processed = process_docs(dataset)
# Convert to query text
doc = processed[0]
query = doc_to_text(doc)
print(query)
# Output: "What is 2+2?\n\nThink for up to 10 steps."
Environment-Based Prompt Configuration
# Configure thinking steps
export PROMPTSTEP=10
python evaluate.py --tasks aime
# Or configure token budget
export PROMPTTOKEN=2000
python evaluate.py --tasks aime
# Or use long/short thinking
export PROMPTLONG=1
# or
export PROMPTSHORT=1
Evaluating Multiple Samples (Pass@k)
from lmms_eval.tasks.aime.utils import process_results
import os
# Set up GPT-4o-mini processor for answer extraction
os.environ["PROCESSOR"] = "gpt-4o-mini"
os.environ["OPENAI_API_KEY"] = "your-key"
doc = {
"problem": "Solve for x: 2x + 4 = 10",
"answer": "3",
"solution": "Subtract 4: 2x = 6, divide by 2: x = 3"
}
# Multiple model responses
results = [
"The answer is x=3",
"I think x equals 2.9999...",
"x = 3.0",
"Hmm, maybe x=2?",
# ... 60 more samples for pass@64
]
metrics = process_results(doc, results)
print(metrics)
# Output:
# {
# 'exact_match': 1, # First result correct
# 'exact_matches': [1, 0, 1, 0, ...],
# 'cov@2': 1, # At least one correct in first 2
# 'maj@2': 1, # Majority vote correct
# 'avg@2': 0.5, # 50% correct
# 'cov@4': 1,
# 'maj@4': 1,
# 'avg@4': 0.5,
# # ... up to @64
# 'extracted_answers': ['3', '2.9999...', '3.0', '2', ...]
# }
LaTeX Answer Extraction
from lmms_eval.tasks.aime.utils import last_boxed_only_string, remove_boxed
response = r"""
Let me solve this step by step:
... lots of work ...
Therefore, the answer is $\boxed{42}$.
"""
boxed = last_boxed_only_string(response)
print(boxed) # Output: "\\boxed{42}"
answer = remove_boxed(boxed)
print(answer) # Output: "42"
LLM-Based Answer Matching
from lmms_eval.tasks.aime.utils import ChatCompletionSampler, extract_answer_idx
sampler = ChatCompletionSampler(model="gpt-4o-mini")
options = ["2/3", "3/2", "1/2"]
attempt = "The answer is -2 * (-1/3)"
# Returns index (1-based) or -1 if no match
idx = extract_answer_idx(sampler, options, attempt)
print(idx) # Output: "1" (matches first option after simplification)
Implementation Details
Answer Extraction Strategy
- Boxed Extraction: First tries to find \boxed{...} in the response
- Pattern Matching: Falls back to regex pattern matching "Answer: ..."
- Normalization: Converts numeric strings (e.g., "023" → "23")
- LLM Matching: Uses GPT-4o-mini to check equivalence with known answers
- Fallback: Returns raw extracted text if all else fails
Pass@k Metrics
- cov@k (coverage): 1 if any of first k samples is correct, else 0
- maj@k (majority): 1 if most common answer in first k matches ground truth
- avg@k (average): Fraction of first k samples that are correct
EXTRACTION_TEMPLATE_IDX
This template includes 10+ few-shot examples teaching GPT-4o-mini to match answers with tolerance for:
- Formatting differences (spaces, LaTeX)
- Unit differences (cents vs dollars)
- Trivial simplifications (2/(-3) vs -2/3)
- Order differences in multi-value answers
- Base notation (2516_8 vs 2516)
Error Handling
- Missing solutions: Prints warning and continues
- Out-of-bounds indices: Logs warning and leaves answer unchanged
- Non-integer indices: Logs warning and leaves answer unchanged
- OpenAI API errors: Catches BadRequestError and returns empty string
- Rate limits: Exponential backoff with 2^trial second delays