Principle:SqueezeAILab ETS Score Aggregation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Statistical_Aggregation |
| Last Updated | 2026-02-14 02:00 GMT |
Overview
A family of aggregation functions that combine per-step PRM scores into a single trajectory-level score for candidate ranking and weighted voting.
Description
Each candidate answer produced by the ETS tree search has a list of per-step PRM scores (one score per reasoning step in the trajectory). To rank candidates or weight their votes, these step-level scores must be aggregated into a single value. The ETS evaluation pipeline provides four aggregation strategies:
- Min (agg_min): Returns the minimum step score. This is the most conservative strategy, penalizing any weak reasoning step regardless of how strong the other steps are.
- Mean (agg_mean): Returns the arithmetic mean of all step scores. Provides a balanced view of overall trajectory quality.
- Product (agg_prod): Returns the product of all step scores. Approximates a joint probability interpretation where each step is an independent event.
- Last (agg_last): Returns only the final step score. This is the default strategy used in the evaluation scripts, assuming the last step's score captures cumulative quality.
Usage
Select an aggregation strategy before running evaluation. The choice is specified via the --agg_func CLI argument in math_evaluate.py. The "last" strategy is used by default in the provided evaluation scripts. When using majority voting, the aggregation function serves as the weight function via --weight_agg.
Theoretical Basis
Given a trajectory with step scores , the aggregation functions are:
- Failed to parse (syntax error): {\displaystyle \text{agg\_min}(s) = \min_i s_i}
- Failed to parse (syntax error): {\displaystyle \text{agg\_mean}(s) = \frac{1}{n} \sum_i s_i}
- Failed to parse (syntax error): {\displaystyle \text{agg\_prod}(s) = \prod_i s_i}
- Failed to parse (syntax error): {\displaystyle \text{agg\_last}(s) = s_n}
The choice of aggregation function reflects different assumptions about how step quality relates to trajectory quality:
- Min assumes the trajectory is only as strong as its weakest step (bottleneck model)
- Mean assumes each step contributes equally to overall quality
- Product models steps as independent success probabilities
- Last assumes the final step score captures cumulative quality (empirically strong for PRM-scored trajectories)