Principle:Sail sg LongSpec Math Equivalence Evaluation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Evaluation, Mathematics, Symbolic_Computation |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Algorithmic principle for determining mathematical equivalence between predicted and reference answers using a multi-level comparison cascade from string matching through symbolic computation.
Description
Math Equivalence Evaluation addresses the fundamental challenge that mathematically identical expressions can have different string representations (e.g., "0.5", "1/2", "\\frac{1}{2}"). The evaluation uses a three-level comparison cascade: (1) exact string match after normalization, (2) numerical comparison with tolerance (handling percentage variants), and (3) symbolic equivalence via SymPy parsing (parse_latex, parse_expr) and simplification. The symbolic level includes timeout protection via multiprocessing to handle expensive SymPy computations. Specialized evaluation functions exist for different benchmarks, handling list answers (MATH), multiple-choice (AGIEval), symbolic equations (OCW Courses), and simple string match (SAT).
Usage
Apply this principle when building the correctness-checking layer of a math benchmark evaluation pipeline. It sits between the answer extraction layer and the metrics aggregation layer.
Theoretical Basis
The equivalence check follows a three-level cascade:
# Abstract algorithm (NOT real implementation)
def math_equal(prediction, reference):
# Level 1: String equality (after normalization)
if str(prediction) == str(reference):
return True
# Level 2: Numerical equality (with tolerance)
if is_number(prediction) and is_number(reference):
return abs(float(prediction) - float(reference)) < tolerance
# Level 3: Symbolic equality (via SymPy)
pred_expr = parse(prediction) # parse_latex or parse_expr
ref_expr = parse(reference)
return simplify(pred_expr - ref_expr) == 0
For OCW Courses, answers are categorized by type:
- Numeric: Unit-stripped float comparison with relative threshold
- Equation: Parse to SymPy Equality and compare
- Expression: TeX normalization (Lewkowycz et al. 2022) then symbolic comparison