Implementation:Sail sg LongSpec Math Equivalence Engine
| Knowledge Sources | |
|---|---|
| Domains | NLP, Evaluation, Mathematics |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Concrete tool for determining mathematical equivalence between predicted and reference answers using numerical, symbolic, and LaTeX-based comparison.
Description
The eval_utils module provides the core math_equal function and supporting utilities for comparing mathematical expressions. It supports three levels of comparison: (1) numerical equality with tolerance, (2) symbolic equality via SymPy parsing and simplification, and (3) structural equality for matrices and tuples. It also includes ground truth parsing for multiple datasets (MATH, GSM8K, TabMWP, BBH, etc.), prediction normalization, and timeout-protected symbolic comparison using multiprocessing.
Usage
Import this module when you need to check whether a model-predicted math answer is equivalent to a ground truth answer during benchmark evaluation. It is the core equivalence engine used by the evaluation callbacks.
Code Reference
Source Location
- Repository: Sail_sg_LongSpec
- File: longspec/train/data/deepseek_math_utils/eval_utils.py
- Lines: 1-331
Signature
def math_equal(
prediction: Union[bool, float, str],
reference: Union[float, str],
include_percentage: bool = True,
is_close: bool = True,
timeout: bool = False,
) -> bool:
"""
Exact match of math if and only if:
1. numerical equal: both can convert to float and are equal
2. symbolic equal: both can convert to sympy expression and are equal
"""
def symbolic_equal(a: str, b: str) -> bool:
"""Check symbolic equality via SymPy parse_latex/parse_expr and simplify."""
def parse_ground_truth(example: Dict[str, Any], data_name: str) -> Tuple[str, str]:
"""Parse ground truth answer from dataset examples (MATH, GSM8K, TabMWP, etc.)."""
def parse_question(example: dict, data_name: str) -> str:
"""Extract question text from dataset examples."""
def normalize_prediction(prediction: str) -> str:
"""Normalize prediction string via numerical rounding or symbolic parsing."""
Import
from data.deepseek_math_utils.eval_utils import math_equal, parse_ground_truth, symbolic_equal
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prediction | Union[bool, float, str] | Yes | Model-predicted answer |
| reference | Union[float, str] | Yes | Ground truth reference answer |
| include_percentage | bool | No | Whether to check percentage variants (default True) |
| is_close | bool | No | Whether to use approximate comparison (default True) |
| timeout | bool | No | Whether to use timeout-protected symbolic comparison |
Outputs
| Name | Type | Description |
|---|---|---|
| result | bool | True if prediction matches reference |
Usage Examples
from data.deepseek_math_utils.eval_utils import math_equal
# Numerical equality
assert math_equal("3.14", "3.14") == True
# Percentage handling
assert math_equal("50", "0.5", include_percentage=True) == True
# Symbolic equality (LaTeX)
assert math_equal("\\frac{1}{2}", "0.5") == True
# Matrix comparison
assert math_equal(
"\\begin{pmatrix} 1 & 2 \\\\ 3 & 4 \\end{pmatrix}",
"\\begin{pmatrix} 1 & 2 \\\\ 3 & 4 \\end{pmatrix}"
) == True