Overview
Concrete tool for computing aggregate evaluation statistics and organizing leaderboard comparisons provided by the Wandb Weave library.
Description
auto_summarize() takes a list of score dictionaries (one per evaluated example) and computes aggregate statistics. For numeric values, it computes mean and standard error. For boolean values, it computes true_count and true_fraction. Nested dictionaries are processed recursively.
Leaderboard and get_leaderboard_results() organize results from multiple evaluation runs into a structured comparison across models and scoring columns.
Usage
auto_summarize is called automatically by Scorer.summarize() during evaluation. Use Leaderboard when comparing evaluation results across multiple models.
Code Reference
Source Location
- Repository: wandb/weave
- File: weave/flow/scorer.py (auto_summarize)
- Lines: L134-186
- File: weave/flow/leaderboard.py (Leaderboard)
- Lines: L13-95
Signature
def auto_summarize(data: list) -> dict[str, Any] | None:
"""Automatically summarize a list of (potentially nested) dicts.
Computes:
- avg for numeric cols
- count and fraction for boolean cols
- other col types are ignored
Returns:
dict of summary stats, with structure matching input dict structure.
"""
def get_leaderboard_results(
spec: Leaderboard, client: WeaveClient
) -> list[LeaderboardModelResult]:
"""Get leaderboard results for a Leaderboard spec and WeaveClient."""
Import
from weave.flow.scorer import auto_summarize
from weave.flow.leaderboard import Leaderboard, get_leaderboard_results
I/O Contract
Inputs (auto_summarize)
| Name |
Type |
Required |
Description
|
| data |
list |
Yes |
List of score dicts from all evaluated examples
|
Outputs (auto_summarize)
| Name |
Type |
Description
|
| return |
None |
Summary stats: mean/stderr for numerics, true_count/true_fraction for booleans
|
Inputs (get_leaderboard_results)
| Name |
Type |
Required |
Description
|
| spec |
Leaderboard |
Yes |
Leaderboard configuration with columns
|
| client |
WeaveClient |
Yes |
Authenticated Weave client
|
Outputs (get_leaderboard_results)
| Name |
Type |
Description
|
| return |
list[LeaderboardModelResult] |
Per-model results with column scores
|
Usage Examples
Auto Summarize
from weave.flow.scorer import auto_summarize
scores = [
{"accuracy": True, "latency": 0.5},
{"accuracy": False, "latency": 0.3},
{"accuracy": True, "latency": 0.4},
]
summary = auto_summarize(scores)
# {
# "accuracy": {"true_count": 2, "true_fraction": 0.667},
# "latency": {"mean": 0.4, "stderr": 0.058}
# }
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.