Implementation:Sail sg LongSpec Math Equivalence Engine

Knowledge Sources	Sail_sg_LongSpec DeepSeek Math
Domains	NLP, Evaluation, Mathematics
Last Updated	2026-02-14 05:00 GMT

Overview

Concrete tool for determining mathematical equivalence between predicted and reference answers using numerical, symbolic, and LaTeX-based comparison.

Description

The eval_utils module provides the core math_equal function and supporting utilities for comparing mathematical expressions. It supports three levels of comparison: (1) numerical equality with tolerance, (2) symbolic equality via SymPy parsing and simplification, and (3) structural equality for matrices and tuples. It also includes ground truth parsing for multiple datasets (MATH, GSM8K, TabMWP, BBH, etc.), prediction normalization, and timeout-protected symbolic comparison using multiprocessing.

Usage

Import this module when you need to check whether a model-predicted math answer is equivalent to a ground truth answer during benchmark evaluation. It is the core equivalence engine used by the evaluation callbacks.

Code Reference

Source Location

Repository: Sail_sg_LongSpec
File: longspec/train/data/deepseek_math_utils/eval_utils.py
Lines: 1-331

Signature

def math_equal(
    prediction: Union[bool, float, str],
    reference: Union[float, str],
    include_percentage: bool = True,
    is_close: bool = True,
    timeout: bool = False,
) -> bool:
    """
    Exact match of math if and only if:
    1. numerical equal: both can convert to float and are equal
    2. symbolic equal: both can convert to sympy expression and are equal
    """

def symbolic_equal(a: str, b: str) -> bool:
    """Check symbolic equality via SymPy parse_latex/parse_expr and simplify."""

def parse_ground_truth(example: Dict[str, Any], data_name: str) -> Tuple[str, str]:
    """Parse ground truth answer from dataset examples (MATH, GSM8K, TabMWP, etc.)."""

def parse_question(example: dict, data_name: str) -> str:
    """Extract question text from dataset examples."""

def normalize_prediction(prediction: str) -> str:
    """Normalize prediction string via numerical rounding or symbolic parsing."""

Import

from data.deepseek_math_utils.eval_utils import math_equal, parse_ground_truth, symbolic_equal

I/O Contract

Inputs

Name	Type	Required	Description
prediction	Union[bool, float, str]	Yes	Model-predicted answer
reference	Union[float, str]	Yes	Ground truth reference answer
include_percentage	bool	No	Whether to check percentage variants (default True)
is_close	bool	No	Whether to use approximate comparison (default True)
timeout	bool	No	Whether to use timeout-protected symbolic comparison

Outputs

Name	Type	Description
result	bool	True if prediction matches reference

Usage Examples

from data.deepseek_math_utils.eval_utils import math_equal

# Numerical equality
assert math_equal("3.14", "3.14") == True

# Percentage handling
assert math_equal("50", "0.5", include_percentage=True) == True

# Symbolic equality (LaTeX)
assert math_equal("\\frac{1}{2}", "0.5") == True

# Matrix comparison
assert math_equal(
    "\\begin{pmatrix} 1 & 2 \\\\ 3 & 4 \\end{pmatrix}",
    "\\begin{pmatrix} 1 & 2 \\\\ 3 & 4 \\end{pmatrix}"
) == True

Related Pages

Environment:Sail_sg_LongSpec_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment