Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Sail sg LongSpec Math Equivalence Engine

From Leeroopedia
Revision as of 13:49, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Sail_sg_LongSpec_Math_Equivalence_Engine.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains NLP, Evaluation, Mathematics
Last Updated 2026-02-14 05:00 GMT

Overview

Concrete tool for determining mathematical equivalence between predicted and reference answers using numerical, symbolic, and LaTeX-based comparison.

Description

The eval_utils module provides the core math_equal function and supporting utilities for comparing mathematical expressions. It supports three levels of comparison: (1) numerical equality with tolerance, (2) symbolic equality via SymPy parsing and simplification, and (3) structural equality for matrices and tuples. It also includes ground truth parsing for multiple datasets (MATH, GSM8K, TabMWP, BBH, etc.), prediction normalization, and timeout-protected symbolic comparison using multiprocessing.

Usage

Import this module when you need to check whether a model-predicted math answer is equivalent to a ground truth answer during benchmark evaluation. It is the core equivalence engine used by the evaluation callbacks.

Code Reference

Source Location

Signature

def math_equal(
    prediction: Union[bool, float, str],
    reference: Union[float, str],
    include_percentage: bool = True,
    is_close: bool = True,
    timeout: bool = False,
) -> bool:
    """
    Exact match of math if and only if:
    1. numerical equal: both can convert to float and are equal
    2. symbolic equal: both can convert to sympy expression and are equal
    """

def symbolic_equal(a: str, b: str) -> bool:
    """Check symbolic equality via SymPy parse_latex/parse_expr and simplify."""

def parse_ground_truth(example: Dict[str, Any], data_name: str) -> Tuple[str, str]:
    """Parse ground truth answer from dataset examples (MATH, GSM8K, TabMWP, etc.)."""

def parse_question(example: dict, data_name: str) -> str:
    """Extract question text from dataset examples."""

def normalize_prediction(prediction: str) -> str:
    """Normalize prediction string via numerical rounding or symbolic parsing."""

Import

from data.deepseek_math_utils.eval_utils import math_equal, parse_ground_truth, symbolic_equal

I/O Contract

Inputs

Name Type Required Description
prediction Union[bool, float, str] Yes Model-predicted answer
reference Union[float, str] Yes Ground truth reference answer
include_percentage bool No Whether to check percentage variants (default True)
is_close bool No Whether to use approximate comparison (default True)
timeout bool No Whether to use timeout-protected symbolic comparison

Outputs

Name Type Description
result bool True if prediction matches reference

Usage Examples

from data.deepseek_math_utils.eval_utils import math_equal

# Numerical equality
assert math_equal("3.14", "3.14") == True

# Percentage handling
assert math_equal("50", "0.5", include_percentage=True) == True

# Symbolic equality (LaTeX)
assert math_equal("\\frac{1}{2}", "0.5") == True

# Matrix comparison
assert math_equal(
    "\\begin{pmatrix} 1 & 2 \\\\ 3 & 4 \\end{pmatrix}",
    "\\begin{pmatrix} 1 & 2 \\\\ 3 & 4 \\end{pmatrix}"
) == True

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment