Implementation:SqueezeAILab ETS Evaluate And Majority Vote
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Accuracy_Measurement |
| Last Updated | 2026-02-14 02:00 GMT |
Overview
Concrete tool for computing accuracy via best-of-n selection or weighted majority voting, provided by math_evaluate.py.
Description
Two evaluation functions compute accuracy over tree search results:
- evaluate(): Best-of-n selection — for each question, picks the candidate with highest aggregated score, extracts its answer, and grades against ground truth
- majority_vote(): Weighted voting — for each question, groups candidates into equivalence classes using
grade_answer(), sums weights, and selects the highest-weighted class
Both return accuracy as a float (num_correct / total).
Usage
Called from main() in math_evaluate.py based on the --agg_func CLI argument. For majority voting, additionally requires --weighted and --weight_agg arguments.
Code Reference
Source Location
- Repository: ETS
- File: math_evaluate.py
- Lines: 91-108 (evaluate), 110-143 (majority_vote), 147-191 (main with CLI)
Signature
def evaluate(path, aggfunc, extract_function):
"""
Best-of-N evaluation: select highest-scoring candidate per question.
Args:
path (str): Path to answers.json file
aggfunc (Callable): Score aggregation function (agg_min/mean/prod/last)
extract_function (Callable): Answer extraction function
Returns:
float: Accuracy (num_correct / total)
"""
def majority_vote(path, weighted, weight_func, extract_function):
"""
Weighted majority voting evaluation.
Args:
path (str): Path to answers.json file
weighted (bool): Whether to weight votes by step scores
weight_func (Callable): Weight aggregation function (e.g., agg_last)
extract_function (Callable): Answer extraction function
Returns:
float: Accuracy (num_correct / total)
"""
Import
# Defined in math_evaluate.py
# CLI entry point:
# python3 math_evaluate.py --path answers.json --agg_func majority_vote \
# --model_type llemma --weighted True --weight_agg last
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | Yes | Path to answers.json from tree search output |
| aggfunc | Callable | Yes (evaluate) | Score aggregation function |
| extract_function | Callable | Yes | Answer extraction function for the model type |
| weighted | bool | Yes (majority_vote) | Whether to weight votes |
| weight_func | Callable | Yes (majority_vote, if weighted) | Weight aggregation function |
Outputs
| Name | Type | Description |
|---|---|---|
| accuracy | float | Fraction of correctly answered questions (num_correct / total) |
| stdout | text | Prints num_correct count |
| output_path (optional) | File | Appends accuracy to text file if --output_path specified |
Usage Examples
Best-of-N Evaluation
# Evaluate using best-of-n with last step score
accuracy = evaluate(
path="exp_results/ets_16_math500/answers.json",
aggfunc=agg_last,
extract_function=extract_shepherd_answer,
)
print(f"Accuracy: {accuracy:.4f}")
Weighted Majority Voting
# Evaluate using weighted majority voting
accuracy = majority_vote(
path="exp_results/ets_16_math500/answers.json",
weighted=True,
weight_func=agg_last,
extract_function=extract_shepherd_answer,
)
print(f"Accuracy: {accuracy:.4f}")
CLI Usage
# Best-of-n with min aggregation
python3 math_evaluate.py --path exp_results/ets_16_math500/answers.json \
--agg_func min --model_type llemma
# Weighted majority voting (default in evaluate.sh)
python3 math_evaluate.py --path exp_results/ets_16_math500/answers.json \
--agg_func majority_vote --model_type llemma \
--weighted True --weight_agg last \
--output_path exp_results/ets_16_math500/results_vote_last.txt
Sweep Evaluation Script
# From scripts/evaluate.sh - evaluates across multiple widths
for WIDTH in 16 64 256; do
export path="exp_results/ets_${WIDTH}_math500/"
python3 ./math_evaluate.py --path $path/answers.json \
--agg_func majority_vote \
--output_path $path/results_vote_last.txt \
--model_type llemma \
--weighted True \
--weight_agg last
done
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment