Workflow:Wandb Weave Evaluation Pipeline

Knowledge Sources	Wandb Weave Weave Documentation
Domains	LLM_Ops, Evaluation, Quality_Assurance
Last Updated	2026-02-14 11:00 GMT

Overview

End-to-end process for building rigorous, reproducible evaluations of LLM applications using Weave Datasets, Scorers, and the Evaluation framework.

Description

This workflow covers the complete evaluation lifecycle in Weave: constructing a versioned dataset of test examples, defining scorer functions that measure model quality, running an evaluation that applies the model to every example and scores each output, and analyzing the aggregated results. The framework supports both a declarative API (using the Evaluation class) and an imperative API (using EvaluationLogger) for fine-grained control. Scorers can be simple functions, Weave Scorer subclasses, or LLM-as-a-Judge configurations. Results are automatically aggregated with statistics for boolean and numeric scores.

Usage

Execute this workflow when you need to systematically measure the quality of an LLM application across a set of test cases. Common triggers include comparing model versions, validating prompt changes, benchmarking against a baseline, or setting up continuous evaluation as part of a deployment pipeline.

Execution Steps

Step 1: Prepare the Dataset

Create a Weave Dataset containing the test examples. Each row is a dictionary with input fields and optional expected output fields. Datasets can be constructed from Python lists, Pandas DataFrames, or HuggingFace datasets. Publishing the dataset to Weave creates a versioned, immutable snapshot.

Key considerations:

Each row should contain all fields needed by both the model and the scorers
Datasets are versioned automatically on publish; new rows can be appended with add_rows()
Factory methods from_pandas() and from_hf() simplify conversion from common formats
Column names should be consistent and match the model predict function's parameter names

Step 2: Define a Model

Create a Weave Model subclass with a predict method decorated with @weave.op. The model encapsulates the inference logic and any configuration parameters (prompt templates, model names, hyperparameters). The predict method receives dataset columns as keyword arguments.

Key considerations:

The model must implement one of: predict, infer, forward, or invoke
The method must be decorated with @weave.op to enable tracing
Model parameters (prompt, temperature, model_name) are tracked as part of the evaluation record
A preprocess_model_input function can remap dataset columns to model arguments

Step 3: Define Scorers

Create scoring functions or Scorer subclasses that evaluate model outputs against expected results. Each scorer receives the model output and relevant dataset columns, and returns a score (boolean, numeric, or dictionary). Scorers can also implement a summarize method for custom aggregation.

Key considerations:

Simple functions decorated with @weave.op work as scorers when they accept an output parameter
Scorer subclasses support column_map to remap dataset column names to scorer argument names
Built-in scorers include: Hallucination, Coherence, Fluency, Similarity, RAGAS, Bias, PII detection
Multiple scorers can be applied to the same evaluation for multi-dimensional quality assessment

Step 4: Run the Evaluation

Instantiate an Evaluation with the dataset and scorers, then call evaluate() with the model. The framework iterates over every dataset row, applies the model, runs all scorers on each output, and collects results. The evaluation runs asynchronously for performance.

Key considerations:

The evaluation applies each scorer independently to every prediction
Errors in individual predictions or scores are captured without halting the entire run
Model latency is recorded per prediction for performance analysis
The evaluation creates a traced call tree linking all predictions and scores

Step 5: Analyze Results

Examine the evaluation results, which include per-row scores and aggregated summaries. Boolean scores produce true/false counts and fractions. Numeric scores produce averages. Results can be compared across evaluation runs using the Weave UI or the Leaderboard feature.

Key considerations:

The auto_summarize function handles aggregation for common score types
Leaderboards allow comparing multiple evaluation runs with configurable columns
Results are persisted in Weave for historical comparison and regression detection
The imperative EvaluationLogger API provides fine-grained control for custom evaluation workflows

Execution Diagram

GitHub URL

Workflow Repository