Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Wandb Weave Evaluation

From Leeroopedia
Knowledge Sources
Domains Evaluation, LLM_Operations
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for orchestrating model evaluations provided by the Wandb Weave library.

Description

The Evaluation class ties together a dataset, scorers, and optional preprocessing into a single evaluable unit. Calling evaluate(model) iterates over the dataset, applies the model and scorers to each example in parallel, and returns summary statistics. The entire evaluation is traced as a Weave call tree.

DatasetLike and ScorerLike type aliases allow auto-casting from plain lists, DataFrames, functions, and other compatible types.

Usage

Construct an Evaluation with a dataset and scorers, then call evaluate() with a model (or Op) to run the evaluation.

Code Reference

Source Location

  • Repository: wandb/weave
  • File: weave/evaluation/eval.py
  • Lines: L61-411

Signature

@register_object
class Evaluation(Object):
    """Sets up an evaluation with scorers and a dataset.

    Calling evaluation.evaluate(model) passes dataset rows into the model,
    runs scorers, and saves results in Weave.
    """
    dataset: DatasetLike
    scorers: list[ScorerLike] | None = None
    preprocess_model_input: PreprocessModelInput | None = None
    trials: int = 1
    metadata: dict[str, Any] | None = None
    evaluation_name: str | CallDisplayNameFunc | None = None

    @op(eager_call_start=True)
    async def evaluate(self, model: Op | Model) -> dict:
        """Run the evaluation loop and return summary statistics."""

    @op
    async def predict_and_score(self, model: Op | Model, example: dict) -> dict:
        """Apply model and all scorers to a single example."""

    @op
    async def summarize(self, eval_table: EvaluationResults) -> dict:
        """Summarize all scores into aggregate statistics."""

Import

import weave
# or
from weave import Evaluation

I/O Contract

Inputs

Name Type Required Description
dataset DatasetLike Yes Dataset, list[dict], or DataFrame (auto-cast)
scorers None No List of scorers, functions, or Ops (auto-cast)
preprocess_model_input None No Transform dataset rows before model input
trials int No Number of times to evaluate each example (default 1)
metadata None No Metadata attached to the evaluation
evaluation_name CallDisplayNameFunc | None No Custom display name

Outputs (evaluate)

Name Type Description
return dict Per-scorer summary stats plus model_latency stats
get_evaluate_calls() CallsIter Iterator over all evaluation Call objects
get_score_calls() dict[str, list[Call]] Scorer calls grouped by trace ID
get_scores() dict[str, dict[str, list[Any]]] Score outputs organized by trace and scorer

Usage Examples

Basic Evaluation

import weave
import asyncio

weave.init("my-team/my-project")

@weave.op
def match_score(output: dict, expected: str) -> dict:
    return {"match": output.get("answer") == expected}

@weave.op
def my_model(question: str) -> dict:
    return {"answer": "Paris"}

evaluation = weave.Evaluation(
    dataset=[
        {"question": "Capital of France?", "expected": "Paris"},
        {"question": "Capital of Germany?", "expected": "Berlin"},
    ],
    scorers=[match_score],
)

results = asyncio.run(evaluation.evaluate(my_model))
print(results)

With Model Class

import weave
import asyncio

class MyModel(weave.Model):
    prompt_template: str

    @weave.op
    async def predict(self, question: str) -> dict:
        return {"answer": "Paris"}

model = MyModel(prompt_template="Answer: {question}")
evaluation = weave.Evaluation(
    dataset=[{"question": "Capital of France?", "expected": "Paris"}],
    scorers=[match_score],
)

results = asyncio.run(evaluation.evaluate(model))

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment