Workflow:SqueezeAILab ETS Answer Evaluation Pipeline

Knowledge Sources	SqueezeAILab ETS ETS: Efficient Tree Search for Inference-Time Scaling
Domains	LLMs, Evaluation, Math_Reasoning
Last Updated	2025-02-14 00:00 GMT

Overview

End-to-end process for evaluating tree search outputs by extracting candidate answers, aggregating per-step scores, grading against ground truth, and reporting accuracy on math benchmarks.

Description

This workflow takes the raw JSON output from ETS tree search and computes accuracy metrics. It supports two selection strategies: best-of-n (pick the single trajectory with the highest aggregated score) and weighted majority voting (group equivalent answers, weight by aggregated PRM scores, pick the most-voted answer). The grading system uses multi-stage normalization to handle varied answer formats including LaTeX, boxed expressions, fractions, and plain numbers.

Goal: Produce an accuracy score (fraction of correctly solved problems) from the tree search output.

Scope: Covers answer extraction from model-generated text, score aggregation across solution steps, answer grading against ground truth, and accuracy reporting.

Strategy: Uses a pipeline of answer extraction (regex-based with multiple fallback patterns), string normalization (both MATH-compatible and aggressive), and optional symbolic equality checking. Supports configurable score aggregation functions (last, mean, product, minimum) for weighting trajectories.

Usage

Execute this workflow after completing the ETS Experiment Pipeline to compute accuracy on the benchmark. You need the answers.json output file from tree search and the ground truth answers embedded within it. No GPU is required for evaluation.

Execution Steps

Step 1: Select aggregation strategy

Choose how to select the final answer from the set of candidate trajectories produced by tree search. The two main strategies are best-of-n (select the trajectory with the highest aggregated step score) and weighted majority voting (group equivalent answers and weight each vote by its aggregated score). Also choose the score aggregation function that combines per-step PRM scores into a single trajectory score.

Available aggregation functions:

last: Use only the final step's PRM score
mean: Average all step scores
prod: Multiply all step scores together
min: Use the minimum step score across the trajectory

Step 2: Extract answers from trajectories

Parse the model-generated text of each candidate trajectory to extract the final answer. The extraction system uses a cascade of pattern matchers: first checking for "final answer is $...$" patterns, then LaTeX boxed expressions, then "the answer is" phrases, then program output blocks, and finally falling back to the last numeric value in the text. Extracted answers are stripped and normalized.

Key patterns matched:

"final answer is $X$. I hope" (instruction-tuned Llama format)
\boxed{X} (LaTeX boxed answers)
"The answer is: X ки" (Shepherd/Llemma step format)
Last numeric value as ultimate fallback

Step 3: Normalize and grade answers

Compare each extracted answer against the ground truth using a two-stage normalization pipeline. The first stage applies MATH-benchmark-compatible normalization (Hendrycks): stripping LaTeX formatting, fixing fractions, removing units and dollar signs. If the normalized strings match, the answer is correct. Otherwise, a second aggressive normalization pass is applied that handles mixed numbers, implicit multiplication, case insensitivity, and tuple decomposition.

Normalization operations:

Strip LaTeX commands (\text{}, \frac, \sqrt, etc.)
Remove units, currency symbols, and percentage signs
Convert fractions and mixed numbers to canonical form
Lowercase all text for case-insensitive comparison
Parse tuples and intervals for element-wise comparison

Step 4: Aggregate and report accuracy

For best-of-n: select the trajectory with the highest aggregated score per problem, extract and grade its answer, and count correct answers across the dataset. For weighted majority voting: group trajectories by equivalent answers, sum their aggregated scores as weights, select the answer class with the highest total weight, and grade it. Report accuracy as the fraction of correctly answered problems, optionally writing results to a text file.

Output:

Number of correct answers (printed to stdout)
Accuracy as a decimal fraction
Optional text file with aggregation settings and accuracy

Execution Diagram

GitHub URL

Workflow Repository