Overview
Concrete tool for defining evaluation scoring functions provided by the Wandb Weave library.
Description
The Scorer base class provides the standard interface for evaluation metrics. Subclasses implement a score() method (decorated with @weave.op) that receives model output and dataset fields. The optional column_map field remaps dataset column names to scorer parameters.
The default summarize() method delegates to auto_summarize(), which computes mean/stderr for numerics and true_count/true_fraction for booleans.
Usage
Subclass Scorer and implement score() to create custom evaluation metrics. Alternatively, decorate a plain function with @weave.op for simple scoring logic.
Code Reference
Source Location
- Repository: wandb/weave
- File: weave/flow/scorer.py
- Lines: L30-186 (Scorer class + auto_summarize)
Signature
class Scorer(Object):
column_map: dict[str, str] | None = Field(
default=None,
description="A mapping from dataset column names to scorer parameter names",
)
@op
def score(self, *, output: Any, **kwargs: Any) -> Any:
"""Score model output. Must be overridden by subclasses."""
raise NotImplementedError
@op
def summarize(self, score_rows: list) -> dict | None:
"""Summarize scores. Defaults to auto_summarize."""
return auto_summarize(score_rows)
def auto_summarize(data: list) -> dict[str, Any] | None:
"""Automatically summarize a list of (potentially nested) dicts.
Computes avg for numeric cols, count/fraction for boolean cols.
"""
Import
import weave
# or
from weave import Scorer
I/O Contract
Inputs (score)
| Name |
Type |
Required |
Description
|
| output |
Any |
Yes |
Model prediction output (keyword-only)
|
| **kwargs |
Any |
Varies |
Dataset columns mapped via column_map or matched by name
|
Outputs (score)
| Name |
Type |
Description
|
| return |
Any |
Score result (dict, bool, float, or WeaveScorerResult)
|
Inputs (auto_summarize)
| Name |
Type |
Required |
Description
|
| data |
list |
Yes |
List of score dicts from all examples
|
Outputs (auto_summarize)
| Name |
Type |
Description
|
| return |
None |
mean/stderr for numerics, true_count/true_fraction for booleans
|
Usage Examples
Class-Based Scorer
import weave
class ExactMatchScorer(weave.Scorer):
@weave.op
def score(self, *, output: dict, expected: str) -> dict:
return {"match": output.get("answer") == expected}
Function-Based Scorer
import weave
@weave.op
def match_score(output: dict, expected: str) -> dict:
return {"match": output.get("answer") == expected}
With Column Mapping
import weave
class MyScorer(weave.Scorer):
column_map = {"expected": "ground_truth"}
@weave.op
def score(self, *, output: dict, expected: str) -> dict:
return {"match": output.get("answer") == expected}
# Dataset has "ground_truth" column, mapped to "expected" parameter
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.