Principle:SqueezeAILab ETS Score Aggregation

Knowledge Sources	ETS
Domains	Evaluation, Statistical_Aggregation
Last Updated	2026-02-14 02:00 GMT

Overview

A family of aggregation functions that combine per-step PRM scores into a single trajectory-level score for candidate ranking and weighted voting.

Description

Each candidate answer produced by the ETS tree search has a list of per-step PRM scores (one score per reasoning step in the trajectory). To rank candidates or weight their votes, these step-level scores must be aggregated into a single value. The ETS evaluation pipeline provides four aggregation strategies:

Min (agg_min): Returns the minimum step score. This is the most conservative strategy, penalizing any weak reasoning step regardless of how strong the other steps are.
Mean (agg_mean): Returns the arithmetic mean of all step scores. Provides a balanced view of overall trajectory quality.
Product (agg_prod): Returns the product of all step scores. Approximates a joint probability interpretation where each step is an independent event.
Last (agg_last): Returns only the final step score. This is the default strategy used in the evaluation scripts, assuming the last step's score captures cumulative quality.

Usage

Select an aggregation strategy before running evaluation. The choice is specified via the --agg_func CLI argument in math_evaluate.py. The "last" strategy is used by default in the provided evaluation scripts. When using majority voting, the aggregation function serves as the weight function via --weight_agg.

Theoretical Basis

Given a trajectory with step scores $s_{1}, s_{2}, \dots, s_{n}$ , the aggregation functions are:

Failed to parse (syntax error): {\displaystyle \text{agg\_min}(s) = \min_i s_i}
Failed to parse (syntax error): {\displaystyle \text{agg\_mean}(s) = \frac{1}{n} \sum_i s_i}
Failed to parse (syntax error): {\displaystyle \text{agg\_prod}(s) = \prod_i s_i}
Failed to parse (syntax error): {\displaystyle \text{agg\_last}(s) = s_n}

The choice of aggregation function reflects different assumptions about how step quality relates to trajectory quality:

Min assumes the trajectory is only as strong as its weakest step (bottleneck model)
Mean assumes each step contributes equally to overall quality
Product models steps as independent success probabilities
Last assumes the final step score captures cumulative quality (empirically strong for PRM-scored trajectories)

Related Pages

Implemented By

Implementation:SqueezeAILab_ETS_Score_Aggregation_Functions

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment