Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval aime utils

From Leeroopedia
Revision as of 12:32, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/EvolvingLMMs_Lab_Lmms_eval_aime_utils.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Mathematical Reasoning, Answer Extraction, Evaluation Metrics
Last Updated 2026-02-14 00:00 GMT

Overview

Utilities for evaluating AIME (American Invitational Mathematics Examination) tasks with answer extraction and pass@k metrics.

Description

This module provides specialized utilities for evaluating mathematical reasoning tasks, particularly AIME problems. It includes prompt templates with configurable thinking steps/tokens, answer extraction from model outputs using regex patterns and LLM-based matching, processing of LaTeX boxed answers, multiple sampling evaluation metrics (coverage@k, majority@k, average@k), and integration with GPT-4o-mini for answer equivalence checking. The code handles various answer formats including LaTeX, numeric values, and text responses.

Usage

Use this module when evaluating mathematical reasoning tasks that require extracting answers from verbose model outputs, comparing answers with tolerance for formatting differences (e.g., "2/3" vs "-2/(-3)"), computing pass@k metrics from multiple samples, or handling LaTeX mathematical notation in answers.

Code Reference

Source Location

Signature

# Query templates (configured via environment variables)
QUERY_TEMPLATE: str  # Configured via PROMPTSTEP, PROMPTTOKEN, PROMPTLONG, PROMPTSHORT

ANSWER_PATTERN: str = r"(?i)Answer\s*:\s*(.*)"

EXTRACTION_TEMPLATE_IDX: str  # Template for LLM-based answer matching

def doc_to_text(doc: dict) -> str

def process_docs(dataset: datasets.Dataset) -> datasets.Dataset

def process_results(doc: dict, results: List[str]) -> Dict[str, int]

def last_boxed_only_string(string: str) -> Optional[str]

def remove_boxed(s: str) -> str

def extract_answer_idx(sampler, options: List[str], attempt: str) -> str

class ChatCompletionSampler:
    def __init__(
        self,
        model: str = "gpt-4o-mini",
        system_message: str | None = None,
        temperature: float = 0.5,
        max_tokens: int = 1024,
    )

    def __call__(self, message_list) -> str

Import

from lmms_eval.tasks.aime.utils import (
    doc_to_text,
    process_docs,
    process_results,
    ChatCompletionSampler
)

I/O Contract

Inputs

Name Type Required Description
doc dict Yes Document with 'problem', 'solution', 'answer' keys
results List[str] Yes Model-generated answers (multiple for pass@k metrics)
dataset datasets.Dataset Yes HuggingFace dataset to process

Outputs

Name Type Description
processed_doc dict Standardized document with required fields
metrics Dict[str, int] Evaluation metrics including exact_match, cov@k, maj@k, avg@k
query_text str Formatted question text with prompt template

Usage Examples

Basic Document Processing

from lmms_eval.tasks.aime.utils import doc_to_text, process_docs
import datasets

# Load dataset
dataset = datasets.load_dataset("path/to/aime")

# Process documents
processed = process_docs(dataset)

# Convert to query text
doc = processed[0]
query = doc_to_text(doc)
print(query)
# Output: "What is 2+2?\n\nThink for up to 10 steps."

Environment-Based Prompt Configuration

# Configure thinking steps
export PROMPTSTEP=10
python evaluate.py --tasks aime

# Or configure token budget
export PROMPTTOKEN=2000
python evaluate.py --tasks aime

# Or use long/short thinking
export PROMPTLONG=1
# or
export PROMPTSHORT=1

Evaluating Multiple Samples (Pass@k)

from lmms_eval.tasks.aime.utils import process_results
import os

# Set up GPT-4o-mini processor for answer extraction
os.environ["PROCESSOR"] = "gpt-4o-mini"
os.environ["OPENAI_API_KEY"] = "your-key"

doc = {
    "problem": "Solve for x: 2x + 4 = 10",
    "answer": "3",
    "solution": "Subtract 4: 2x = 6, divide by 2: x = 3"
}

# Multiple model responses
results = [
    "The answer is x=3",
    "I think x equals 2.9999...",
    "x = 3.0",
    "Hmm, maybe x=2?",
    # ... 60 more samples for pass@64
]

metrics = process_results(doc, results)
print(metrics)
# Output:
# {
#     'exact_match': 1,  # First result correct
#     'exact_matches': [1, 0, 1, 0, ...],
#     'cov@2': 1,  # At least one correct in first 2
#     'maj@2': 1,  # Majority vote correct
#     'avg@2': 0.5,  # 50% correct
#     'cov@4': 1,
#     'maj@4': 1,
#     'avg@4': 0.5,
#     # ... up to @64
#     'extracted_answers': ['3', '2.9999...', '3.0', '2', ...]
# }

LaTeX Answer Extraction

from lmms_eval.tasks.aime.utils import last_boxed_only_string, remove_boxed

response = r"""
Let me solve this step by step:
... lots of work ...
Therefore, the answer is $\boxed{42}$.
"""

boxed = last_boxed_only_string(response)
print(boxed)  # Output: "\\boxed{42}"

answer = remove_boxed(boxed)
print(answer)  # Output: "42"

LLM-Based Answer Matching

from lmms_eval.tasks.aime.utils import ChatCompletionSampler, extract_answer_idx

sampler = ChatCompletionSampler(model="gpt-4o-mini")

options = ["2/3", "3/2", "1/2"]
attempt = "The answer is -2 * (-1/3)"

# Returns index (1-based) or -1 if no match
idx = extract_answer_idx(sampler, options, attempt)
print(idx)  # Output: "1" (matches first option after simplification)

Implementation Details

Answer Extraction Strategy

  1. Boxed Extraction: First tries to find \boxed{...} in the response
  2. Pattern Matching: Falls back to regex pattern matching "Answer: ..."
  3. Normalization: Converts numeric strings (e.g., "023" → "23")
  4. LLM Matching: Uses GPT-4o-mini to check equivalence with known answers
  5. Fallback: Returns raw extracted text if all else fails

Pass@k Metrics

  • cov@k (coverage): 1 if any of first k samples is correct, else 0
  • maj@k (majority): 1 if most common answer in first k matches ground truth
  • avg@k (average): Fraction of first k samples that are correct

EXTRACTION_TEMPLATE_IDX

This template includes 10+ few-shot examples teaching GPT-4o-mini to match answers with tolerance for:

  • Formatting differences (spaces, LaTeX)
  • Unit differences (cents vs dollars)
  • Trivial simplifications (2/(-3) vs -2/3)
  • Order differences in multi-value answers
  • Base notation (2516_8 vs 2516)

Error Handling

  • Missing solutions: Prints warning and continues
  • Out-of-bounds indices: Logs warning and leaves answer unchanged
  • Non-integer indices: Logs warning and leaves answer unchanged
  • OpenAI API errors: Catches BadRequestError and returns empty string
  • Rate limits: Exponential backoff with 2^trial second delays

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment