Principle:Sail sg LongSpec Math Equivalence Evaluation

Knowledge Sources	DeepSeek Math Minerva
Domains	NLP, Evaluation, Mathematics, Symbolic_Computation
Last Updated	2026-02-14 05:00 GMT

Overview

Algorithmic principle for determining mathematical equivalence between predicted and reference answers using a multi-level comparison cascade from string matching through symbolic computation.

Description

Math Equivalence Evaluation addresses the fundamental challenge that mathematically identical expressions can have different string representations (e.g., "0.5", "1/2", "\\frac{1}{2}"). The evaluation uses a three-level comparison cascade: (1) exact string match after normalization, (2) numerical comparison with tolerance (handling percentage variants), and (3) symbolic equivalence via SymPy parsing (parse_latex, parse_expr) and simplification. The symbolic level includes timeout protection via multiprocessing to handle expensive SymPy computations. Specialized evaluation functions exist for different benchmarks, handling list answers (MATH), multiple-choice (AGIEval), symbolic equations (OCW Courses), and simple string match (SAT).

Usage

Apply this principle when building the correctness-checking layer of a math benchmark evaluation pipeline. It sits between the answer extraction layer and the metrics aggregation layer.

Theoretical Basis

The equivalence check follows a three-level cascade:

# Abstract algorithm (NOT real implementation)
def math_equal(prediction, reference):
    # Level 1: String equality (after normalization)
    if str(prediction) == str(reference):
        return True

    # Level 2: Numerical equality (with tolerance)
    if is_number(prediction) and is_number(reference):
        return abs(float(prediction) - float(reference)) < tolerance

    # Level 3: Symbolic equality (via SymPy)
    pred_expr = parse(prediction)  # parse_latex or parse_expr
    ref_expr = parse(reference)
    return simplify(pred_expr - ref_expr) == 0

For OCW Courses, answers are categorized by type:

Numeric: Unit-stripped float comparison with relative threshold
Equation: Parse to SymPy Equality and compare
Expression: TeX normalization (Lewkowycz et al. 2022) then symbolic comparison

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment