Principle:EvolvingLMMs Lab Lmms eval Mathematical Answer Extraction
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Mathematical Reasoning |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Mathematical answer extraction identifies and normalizes answers from verbose model outputs containing mathematical reasoning.
Description
Mathematical answer extraction addresses the challenge of identifying final answers in lengthy model-generated reasoning chains. Models often produce extensive explanations with intermediate steps, making it difficult to programmatically locate the actual answer. This principle uses a multi-stage extraction strategy: first attempting to find LaTeX boxed answers (\boxed{...}), then pattern matching for explicit "Answer: ..." statements, followed by LLM-based semantic matching that tolerates formatting differences, and finally fallback to raw extraction. The approach handles various mathematical notations (fractions, exponents, LaTeX), numeric formats (leading zeros, decimal vs fraction), and text answers.
Usage
Apply this principle when evaluating mathematical reasoning tasks where models generate long-form solutions, dealing with diverse answer formats (LaTeX, plain text, numeric), requiring tolerance for mathematically equivalent but syntactically different answers (e.g., "2/3" vs "0.666..."), or validating answers against multiple choice options.
Theoretical Basis
Extraction Hierarchy
- LaTeX Boxed: Extract \boxed{content} as strongest signal of final answer
- Regex Pattern: Match "Answer: ..." patterns (case-insensitive, with whitespace tolerance)
- LLM Matching: Use language model to check equivalence with known options
- Raw Text: Return extracted text as-is if no structured format found
Normalization Techniques
- Leading Zeros: "023" normalized to "23" for numeric comparisons
- Whitespace: Strip leading/trailing spaces, normalize internal spacing
- LaTeX Cleanup: Remove formatting commands while preserving content
- Unit Tolerance: Ignore unit differences (cents vs dollars, degrees vs radians)
- Simplification: Recognize algebraically equivalent forms (2/(-3) ≡ -2/3)
LLM-Based Matching
Uses a separate language model (e.g., GPT-4o-mini) with few-shot examples to determine if an attempt matches any of the provided options. The few-shot template includes:
- Trivial simplifications (3+2x vs 2x+3)
- Formatting differences (72000 vs 72,000)
- Sign manipulation (-1 * 2/3 vs 2/(-3))
- Variable solutions (x=5 vs 5)
- Order independence ((1,-2) vs (-2,1))
- Base notation (2516_8 vs 2516)
Returns 1-based index of matching option or -1 if no match.
Error Tolerance
- Partial Answers: Extract what's available even if incomplete
- Malformed LaTeX: Handle unmatched braces by finding longest valid substring
- Multiple Answers: Take the last occurrence when multiple "Answer: ..." patterns exist
- Empty Responses: Return empty string rather than raising errors
Integration with Evaluation
The extracted answer is compared against ground truth using:
- Exact string matching for text answers
- Numeric equality for integer answers
- LLM-based equivalence for complex mathematical expressions