Principle:SqueezeAILab ETS Answer Normalization And Grading
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Mathematical_Reasoning, String_Normalization |
| Last Updated | 2026-02-14 02:00 GMT |
Overview
A two-stage normalization and comparison pipeline that determines whether a model-generated mathematical answer is equivalent to the ground truth.
Description
Mathematical answers can be expressed in many equivalent forms (e.g., "0.5", "1/2", "\frac{1}{2}"). The grading system must recognize these equivalences while avoiding false positives. The ETS grading pipeline uses a two-stage normalization approach:
Stage 1 — MATH-Compatible Normalization (Hendrycks):
- Removes LaTeX formatting commands (\left, \right, tfrac→frac, dfrac→frac)
- Strips units, degree symbols, dollar signs, percentages
- Fixes malformed fractions (\frac1b → \frac{1}{b})
- Converts a/b to \frac{a}{b} for simple cases
- Converts 0.5 to \frac{1}{2}
Stage 2 — Aggressive Normalization:
- Parses LaTeX to plain text via pylatexenc
- Removes unit words (degree, cm, meter, mile, etc.)
- Expands million/billion/trillion
- Handles implicit mixed numbers (7 3/4 → 7+3/4)
- Rounds floats to integers when appropriate
- Case-insensitive comparison
The grading logic first compares Stage 1 normalized strings. If unequal, it applies Stage 2 normalization. For tuple/interval answers, it splits and compares element-wise. For fractions, it requires exact match (no simplification). The sympy symbolic equality check is defined but currently disabled.
Usage
The grade_answer() function is the primary API for determining answer correctness. It is used both in best-of-n evaluation (to check the selected answer) and in majority voting (to group equivalent answers into equivalence classes).
Theoretical Basis
Answer equivalence in mathematics is fundamentally a semantic equality problem. The two-stage approach provides a pragmatic solution:
# Abstract grading pipeline
def grade(given, truth):
# Stage 1: Conservative normalization (Hendrycks MATH compatible)
if normalize_math(given) == normalize_math(truth):
return True
# Stage 2: Aggressive normalization
if normalize_aggressive(given) == normalize_aggressive(truth):
return True
# Stage 3: Structural comparison (tuples, fractions)
return compare_structured(given, truth)
The multi-stage design balances precision (avoiding false positives) with recall (recognizing equivalent representations). Stage 1 is compatible with the established Hendrycks MATH evaluation protocol, while Stage 2 catches additional equivalences.