Implementation:Wandb Weave Evaluation

Knowledge Sources	Wandb Weave Weave Docs
Domains	Evaluation, LLM_Operations
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tool for orchestrating model evaluations provided by the Wandb Weave library.

Description

The Evaluation class ties together a dataset, scorers, and optional preprocessing into a single evaluable unit. Calling evaluate(model) iterates over the dataset, applies the model and scorers to each example in parallel, and returns summary statistics. The entire evaluation is traced as a Weave call tree.

DatasetLike and ScorerLike type aliases allow auto-casting from plain lists, DataFrames, functions, and other compatible types.

Usage

Construct an Evaluation with a dataset and scorers, then call evaluate() with a model (or Op) to run the evaluation.

Code Reference

Source Location

Repository: wandb/weave
File: weave/evaluation/eval.py
Lines: L61-411

Signature

@register_object
class Evaluation(Object):
    """Sets up an evaluation with scorers and a dataset.

    Calling evaluation.evaluate(model) passes dataset rows into the model,
    runs scorers, and saves results in Weave.
    """
    dataset: DatasetLike
    scorers: list[ScorerLike] | None = None
    preprocess_model_input: PreprocessModelInput | None = None
    trials: int = 1
    metadata: dict[str, Any] | None = None
    evaluation_name: str | CallDisplayNameFunc | None = None

    @op(eager_call_start=True)
    async def evaluate(self, model: Op | Model) -> dict:
        """Run the evaluation loop and return summary statistics."""

    @op
    async def predict_and_score(self, model: Op | Model, example: dict) -> dict:
        """Apply model and all scorers to a single example."""

    @op
    async def summarize(self, eval_table: EvaluationResults) -> dict:
        """Summarize all scores into aggregate statistics."""

Import

import weave
# or
from weave import Evaluation

I/O Contract

Inputs

Name	Type	Required	Description
dataset	DatasetLike	Yes	Dataset, list[dict], or DataFrame (auto-cast)
scorers	None	No	List of scorers, functions, or Ops (auto-cast)
preprocess_model_input	None	No	Transform dataset rows before model input
trials	int	No	Number of times to evaluate each example (default 1)
metadata	None	No	Metadata attached to the evaluation
evaluation_name	CallDisplayNameFunc \| None	No	Custom display name

Outputs (evaluate)

Name	Type	Description
return	dict	Per-scorer summary stats plus model_latency stats
get_evaluate_calls()	CallsIter	Iterator over all evaluation Call objects
get_score_calls()	dict[str, list[Call]]	Scorer calls grouped by trace ID
get_scores()	dict[str, dict[str, list[Any]]]	Score outputs organized by trace and scorer

Usage Examples

Basic Evaluation

import weave
import asyncio

weave.init("my-team/my-project")

@weave.op
def match_score(output: dict, expected: str) -> dict:
    return {"match": output.get("answer") == expected}

@weave.op
def my_model(question: str) -> dict:
    return {"answer": "Paris"}

evaluation = weave.Evaluation(
    dataset=[
        {"question": "Capital of France?", "expected": "Paris"},
        {"question": "Capital of Germany?", "expected": "Berlin"},
    ],
    scorers=[match_score],
)

results = asyncio.run(evaluation.evaluate(my_model))
print(results)

With Model Class

import weave
import asyncio

class MyModel(weave.Model):
    prompt_template: str

    @weave.op
    async def predict(self, question: str) -> dict:
        return {"answer": "Paris"}

model = MyModel(prompt_template="Answer: {question}")
evaluation = weave.Evaluation(
    dataset=[{"question": "Capital of France?", "expected": "Paris"}],
    scorers=[match_score],
)

results = asyncio.run(evaluation.evaluate(model))

Related Pages

Implements Principle

Principle:Wandb_Weave_Evaluation_Execution

Requires Environment

Environment:Wandb_Weave_Python_SDK_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment