Implementation:Wandb Weave Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, LLM_Operations |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for orchestrating model evaluations provided by the Wandb Weave library.
Description
The Evaluation class ties together a dataset, scorers, and optional preprocessing into a single evaluable unit. Calling evaluate(model) iterates over the dataset, applies the model and scorers to each example in parallel, and returns summary statistics. The entire evaluation is traced as a Weave call tree.
DatasetLike and ScorerLike type aliases allow auto-casting from plain lists, DataFrames, functions, and other compatible types.
Usage
Construct an Evaluation with a dataset and scorers, then call evaluate() with a model (or Op) to run the evaluation.
Code Reference
Source Location
- Repository: wandb/weave
- File: weave/evaluation/eval.py
- Lines: L61-411
Signature
@register_object
class Evaluation(Object):
"""Sets up an evaluation with scorers and a dataset.
Calling evaluation.evaluate(model) passes dataset rows into the model,
runs scorers, and saves results in Weave.
"""
dataset: DatasetLike
scorers: list[ScorerLike] | None = None
preprocess_model_input: PreprocessModelInput | None = None
trials: int = 1
metadata: dict[str, Any] | None = None
evaluation_name: str | CallDisplayNameFunc | None = None
@op(eager_call_start=True)
async def evaluate(self, model: Op | Model) -> dict:
"""Run the evaluation loop and return summary statistics."""
@op
async def predict_and_score(self, model: Op | Model, example: dict) -> dict:
"""Apply model and all scorers to a single example."""
@op
async def summarize(self, eval_table: EvaluationResults) -> dict:
"""Summarize all scores into aggregate statistics."""
Import
import weave
# or
from weave import Evaluation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | DatasetLike | Yes | Dataset, list[dict], or DataFrame (auto-cast) |
| scorers | None | No | List of scorers, functions, or Ops (auto-cast) |
| preprocess_model_input | None | No | Transform dataset rows before model input |
| trials | int | No | Number of times to evaluate each example (default 1) |
| metadata | None | No | Metadata attached to the evaluation |
| evaluation_name | CallDisplayNameFunc | None | No | Custom display name |
Outputs (evaluate)
| Name | Type | Description |
|---|---|---|
| return | dict | Per-scorer summary stats plus model_latency stats |
| get_evaluate_calls() | CallsIter | Iterator over all evaluation Call objects |
| get_score_calls() | dict[str, list[Call]] | Scorer calls grouped by trace ID |
| get_scores() | dict[str, dict[str, list[Any]]] | Score outputs organized by trace and scorer |
Usage Examples
Basic Evaluation
import weave
import asyncio
weave.init("my-team/my-project")
@weave.op
def match_score(output: dict, expected: str) -> dict:
return {"match": output.get("answer") == expected}
@weave.op
def my_model(question: str) -> dict:
return {"answer": "Paris"}
evaluation = weave.Evaluation(
dataset=[
{"question": "Capital of France?", "expected": "Paris"},
{"question": "Capital of Germany?", "expected": "Berlin"},
],
scorers=[match_score],
)
results = asyncio.run(evaluation.evaluate(my_model))
print(results)
With Model Class
import weave
import asyncio
class MyModel(weave.Model):
prompt_template: str
@weave.op
async def predict(self, question: str) -> dict:
return {"answer": "Paris"}
model = MyModel(prompt_template="Answer: {question}")
evaluation = weave.Evaluation(
dataset=[{"question": "Capital of France?", "expected": "Paris"}],
scorers=[match_score],
)
results = asyncio.run(evaluation.evaluate(model))