Principle:FlowiseAI Flowise Evaluation Run Creation

Property	Value
Principle Name	Evaluation_Run_Creation
Overview	Technique for configuring and executing systematic evaluation runs that test chatflows against datasets using defined evaluators
Domain	AI Evaluation, Pipeline Orchestration, Quality Assurance
Source	FlowiseAI/Flowise repository: packages/ui/src/api/evaluations.js
Last Updated	2026-02-12 14:00 GMT

Description

An evaluation run combines a dataset, one or more chatflows, and selected evaluators into a single test execution. The system processes each dataset row through each chatflow, then applies all selected evaluators to the results. Multiple chatflows can be compared in a single run, enabling side-by-side quality assessment.

The evaluation run configuration binds together:

Dataset: The input/expected-output pairs that serve as test cases.
Chatflows: One or more chatflows (systems under test) that process each input.
Simple evaluators: Text, JSON, and numeric evaluators applied to each response.
LLM evaluators: AI-based grading evaluators that assess response quality using a separate model.

The run supports two evaluation types:

llm: Uses LLM-based evaluators to grade responses.
benchmarking: Uses simple evaluators (text, JSON, numeric) to check responses.

An optional datasetAsOneConversation flag controls whether all dataset rows are sent as a continuous conversation or as independent queries.

Usage

Use evaluation run creation when executing a systematic quality assessment of one or more chatflows against a test dataset. This is the central orchestration step in the Evaluation Pipeline workflow, requiring:

A pre-existing dataset (created via Dataset Creation)
One or more configured evaluators (created via Evaluator Definition)
One or more target chatflows to evaluate

Theoretical Basis

This principle follows the evaluation pipeline orchestration pattern. The run configuration binds dataset (input/expected pairs) to chatflows (systems under test) and evaluators (scoring criteria). Execution processes each row independently, enabling parallel evaluation and per-row metrics.

Key design aspects:

Cartesian evaluation: Each dataset row is evaluated against each selected chatflow, producing a matrix of results. This enables direct comparison of chatflow quality under identical test conditions.
Evaluator composition: Multiple evaluators of different types can be combined in a single run, providing multi-dimensional quality assessment without requiring separate test executions.
Independent row processing: Each dataset row is processed as an independent test case (unless datasetAsOneConversation is enabled), ensuring that one test failure does not affect others.
Versioned execution: Each run creates a versioned snapshot of results, enabling longitudinal tracking of chatflow quality.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment