Implementation:SqueezeAILab ETS Evaluate And Majority Vote

Knowledge Sources	ETS
Domains	Evaluation, Accuracy_Measurement
Last Updated	2026-02-14 02:00 GMT

Overview

Concrete tool for computing accuracy via best-of-n selection or weighted majority voting, provided by math_evaluate.py.

Description

Two evaluation functions compute accuracy over tree search results:

evaluate(): Best-of-n selection — for each question, picks the candidate with highest aggregated score, extracts its answer, and grades against ground truth
majority_vote(): Weighted voting — for each question, groups candidates into equivalence classes using grade_answer(), sums weights, and selects the highest-weighted class

Both return accuracy as a float (num_correct / total).

Usage

Called from main() in math_evaluate.py based on the --agg_func CLI argument. For majority voting, additionally requires --weighted and --weight_agg arguments.

Code Reference

Source Location

Repository: ETS
File: math_evaluate.py
Lines: 91-108 (evaluate), 110-143 (majority_vote), 147-191 (main with CLI)

Signature

def evaluate(path, aggfunc, extract_function):
    """
    Best-of-N evaluation: select highest-scoring candidate per question.

    Args:
        path (str): Path to answers.json file
        aggfunc (Callable): Score aggregation function (agg_min/mean/prod/last)
        extract_function (Callable): Answer extraction function

    Returns:
        float: Accuracy (num_correct / total)
    """

def majority_vote(path, weighted, weight_func, extract_function):
    """
    Weighted majority voting evaluation.

    Args:
        path (str): Path to answers.json file
        weighted (bool): Whether to weight votes by step scores
        weight_func (Callable): Weight aggregation function (e.g., agg_last)
        extract_function (Callable): Answer extraction function

    Returns:
        float: Accuracy (num_correct / total)
    """

Import

# Defined in math_evaluate.py
# CLI entry point:
# python3 math_evaluate.py --path answers.json --agg_func majority_vote \
#     --model_type llemma --weighted True --weight_agg last

I/O Contract

Inputs

Name	Type	Required	Description
path	str	Yes	Path to answers.json from tree search output
aggfunc	Callable	Yes (evaluate)	Score aggregation function
extract_function	Callable	Yes	Answer extraction function for the model type
weighted	bool	Yes (majority_vote)	Whether to weight votes
weight_func	Callable	Yes (majority_vote, if weighted)	Weight aggregation function

Outputs

Name	Type	Description
accuracy	float	Fraction of correctly answered questions (num_correct / total)
stdout	text	Prints num_correct count
output_path (optional)	File	Appends accuracy to text file if --output_path specified

Usage Examples

Best-of-N Evaluation

# Evaluate using best-of-n with last step score
accuracy = evaluate(
    path="exp_results/ets_16_math500/answers.json",
    aggfunc=agg_last,
    extract_function=extract_shepherd_answer,
)
print(f"Accuracy: {accuracy:.4f}")

Weighted Majority Voting

# Evaluate using weighted majority voting
accuracy = majority_vote(
    path="exp_results/ets_16_math500/answers.json",
    weighted=True,
    weight_func=agg_last,
    extract_function=extract_shepherd_answer,
)
print(f"Accuracy: {accuracy:.4f}")

CLI Usage

# Best-of-n with min aggregation
python3 math_evaluate.py --path exp_results/ets_16_math500/answers.json \
    --agg_func min --model_type llemma

# Weighted majority voting (default in evaluate.sh)
python3 math_evaluate.py --path exp_results/ets_16_math500/answers.json \
    --agg_func majority_vote --model_type llemma \
    --weighted True --weight_agg last \
    --output_path exp_results/ets_16_math500/results_vote_last.txt

Sweep Evaluation Script

# From scripts/evaluate.sh - evaluates across multiple widths
for WIDTH in 16 64 256; do
    export path="exp_results/ets_${WIDTH}_math500/"
    python3 ./math_evaluate.py --path $path/answers.json \
        --agg_func majority_vote \
        --output_path $path/results_vote_last.txt \
        --model_type llemma \
        --weighted True \
        --weight_agg last
done

Related Pages

Implements Principle

Principle:SqueezeAILab_ETS_Evaluation_Reporting

Requires Environment

Environment:SqueezeAILab_ETS_Evaluation_Python_Stack

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment