Principle:SqueezeAILab ETS Answer Normalization And Grading

Knowledge Sources	ETS Hendrycks MATH
Domains	Evaluation, Mathematical_Reasoning, String_Normalization
Last Updated	2026-02-14 02:00 GMT

Overview

A two-stage normalization and comparison pipeline that determines whether a model-generated mathematical answer is equivalent to the ground truth.

Description

Mathematical answers can be expressed in many equivalent forms (e.g., "0.5", "1/2", "\frac{1}{2}"). The grading system must recognize these equivalences while avoiding false positives. The ETS grading pipeline uses a two-stage normalization approach:

Stage 1 — MATH-Compatible Normalization (Hendrycks):

Removes LaTeX formatting commands (\left, \right, tfrac→frac, dfrac→frac)
Strips units, degree symbols, dollar signs, percentages
Fixes malformed fractions (\frac1b → \frac{1}{b})
Converts a/b to \frac{a}{b} for simple cases
Converts 0.5 to \frac{1}{2}

Stage 2 — Aggressive Normalization:

Parses LaTeX to plain text via pylatexenc
Removes unit words (degree, cm, meter, mile, etc.)
Expands million/billion/trillion
Handles implicit mixed numbers (7 3/4 → 7+3/4)
Rounds floats to integers when appropriate
Case-insensitive comparison

The grading logic first compares Stage 1 normalized strings. If unequal, it applies Stage 2 normalization. For tuple/interval answers, it splits and compares element-wise. For fractions, it requires exact match (no simplification). The sympy symbolic equality check is defined but currently disabled.

Usage

The grade_answer() function is the primary API for determining answer correctness. It is used both in best-of-n evaluation (to check the selected answer) and in majority voting (to group equivalent answers into equivalence classes).

Theoretical Basis

Answer equivalence in mathematics is fundamentally a semantic equality problem. The two-stage approach provides a pragmatic solution:

# Abstract grading pipeline
def grade(given, truth):
    # Stage 1: Conservative normalization (Hendrycks MATH compatible)
    if normalize_math(given) == normalize_math(truth):
        return True
    # Stage 2: Aggressive normalization
    if normalize_aggressive(given) == normalize_aggressive(truth):
        return True
    # Stage 3: Structural comparison (tuples, fractions)
    return compare_structured(given, truth)

The multi-stage design balances precision (avoiding false positives) with recall (recognizing equivalent representations). Stage 1 is compatible with the established Hendrycks MATH evaluation protocol, while Stage 2 catches additional equivalences.

Related Pages

Implemented By

Implementation:SqueezeAILab_ETS_Grade_Answer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment