Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Wandb Weave Auto Summarize

From Leeroopedia
Knowledge Sources
Domains Evaluation, Statistics
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for computing aggregate evaluation statistics and organizing leaderboard comparisons provided by the Wandb Weave library.

Description

auto_summarize() takes a list of score dictionaries (one per evaluated example) and computes aggregate statistics. For numeric values, it computes mean and standard error. For boolean values, it computes true_count and true_fraction. Nested dictionaries are processed recursively.

Leaderboard and get_leaderboard_results() organize results from multiple evaluation runs into a structured comparison across models and scoring columns.

Usage

auto_summarize is called automatically by Scorer.summarize() during evaluation. Use Leaderboard when comparing evaluation results across multiple models.

Code Reference

Source Location

  • Repository: wandb/weave
  • File: weave/flow/scorer.py (auto_summarize)
  • Lines: L134-186
  • File: weave/flow/leaderboard.py (Leaderboard)
  • Lines: L13-95

Signature

def auto_summarize(data: list) -> dict[str, Any] | None:
    """Automatically summarize a list of (potentially nested) dicts.

    Computes:
        - avg for numeric cols
        - count and fraction for boolean cols
        - other col types are ignored

    Returns:
        dict of summary stats, with structure matching input dict structure.
    """

def get_leaderboard_results(
    spec: Leaderboard, client: WeaveClient
) -> list[LeaderboardModelResult]:
    """Get leaderboard results for a Leaderboard spec and WeaveClient."""

Import

from weave.flow.scorer import auto_summarize
from weave.flow.leaderboard import Leaderboard, get_leaderboard_results

I/O Contract

Inputs (auto_summarize)

Name Type Required Description
data list Yes List of score dicts from all evaluated examples

Outputs (auto_summarize)

Name Type Description
return None Summary stats: mean/stderr for numerics, true_count/true_fraction for booleans

Inputs (get_leaderboard_results)

Name Type Required Description
spec Leaderboard Yes Leaderboard configuration with columns
client WeaveClient Yes Authenticated Weave client

Outputs (get_leaderboard_results)

Name Type Description
return list[LeaderboardModelResult] Per-model results with column scores

Usage Examples

Auto Summarize

from weave.flow.scorer import auto_summarize

scores = [
    {"accuracy": True, "latency": 0.5},
    {"accuracy": False, "latency": 0.3},
    {"accuracy": True, "latency": 0.4},
]

summary = auto_summarize(scores)
# {
#   "accuracy": {"true_count": 2, "true_fraction": 0.667},
#   "latency": {"mean": 0.4, "stderr": 0.058}
# }

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment