Principle:Wandb Weave Result Analysis

Knowledge Sources	Weave Docs Wandb Weave
Domains	Evaluation, Statistics
Last Updated	2026-02-14 00:00 GMT

Overview

A statistical aggregation mechanism that summarizes per-example evaluation scores into actionable metrics and leaderboard comparisons.

Description

Result Analysis provides two levels of aggregation: (1) auto_summarize computes per-scorer statistics (mean, stderr, count, fraction) from individual example scores, and (2) Leaderboard enables cross-model comparison by organizing evaluation results into a structured ranking.

Usage

Use this principle after running evaluations to interpret results. Auto-summarization happens automatically during evaluation; leaderboards are used when comparing multiple models or evaluation runs.

Theoretical Basis

The summarization algorithm processes each score column:

Numeric values: Compute mean and standard error: $stderr = \frac{σ}{\sqrt{n}}$
Boolean values: Compute true_count and true_fraction: Failed to parse (syntax error): {\displaystyle \text{fraction} = \frac{\text{true\_count}}{n}}
Nested dicts: Apply recursively to inner values.
None values: Rows where the value is None are excluded from statistics.

Leaderboards organize results by model reference and evaluation column, enabling ranked comparison.

Related Pages

Implemented By

Implementation:Wandb_Weave_Auto_Summarize

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment