Principle:Wandb Weave Result Analysis
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Statistics |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
A statistical aggregation mechanism that summarizes per-example evaluation scores into actionable metrics and leaderboard comparisons.
Description
Result Analysis provides two levels of aggregation: (1) auto_summarize computes per-scorer statistics (mean, stderr, count, fraction) from individual example scores, and (2) Leaderboard enables cross-model comparison by organizing evaluation results into a structured ranking.
Usage
Use this principle after running evaluations to interpret results. Auto-summarization happens automatically during evaluation; leaderboards are used when comparing multiple models or evaluation runs.
Theoretical Basis
The summarization algorithm processes each score column:
- Numeric values: Compute mean and standard error:
- Boolean values: Compute true_count and true_fraction: Failed to parse (syntax error): {\displaystyle \text{fraction} = \frac{\text{true\_count}}{n}}
- Nested dicts: Apply recursively to inner values.
- None values: Rows where the value is None are excluded from statistics.
Leaderboards organize results by model reference and evaluation column, enabling ranked comparison.