Principle:FlowiseAI Flowise Evaluation Run Creation
| Property | Value |
|---|---|
| Principle Name | Evaluation_Run_Creation |
| Overview | Technique for configuring and executing systematic evaluation runs that test chatflows against datasets using defined evaluators |
| Domain | AI Evaluation, Pipeline Orchestration, Quality Assurance |
| Source | FlowiseAI/Flowise repository: packages/ui/src/api/evaluations.js |
| Last Updated | 2026-02-12 14:00 GMT |
Description
An evaluation run combines a dataset, one or more chatflows, and selected evaluators into a single test execution. The system processes each dataset row through each chatflow, then applies all selected evaluators to the results. Multiple chatflows can be compared in a single run, enabling side-by-side quality assessment.
The evaluation run configuration binds together:
- Dataset: The input/expected-output pairs that serve as test cases.
- Chatflows: One or more chatflows (systems under test) that process each input.
- Simple evaluators: Text, JSON, and numeric evaluators applied to each response.
- LLM evaluators: AI-based grading evaluators that assess response quality using a separate model.
The run supports two evaluation types:
- llm: Uses LLM-based evaluators to grade responses.
- benchmarking: Uses simple evaluators (text, JSON, numeric) to check responses.
An optional datasetAsOneConversation flag controls whether all dataset rows are sent as a continuous conversation or as independent queries.
Usage
Use evaluation run creation when executing a systematic quality assessment of one or more chatflows against a test dataset. This is the central orchestration step in the Evaluation Pipeline workflow, requiring:
- A pre-existing dataset (created via Dataset Creation)
- One or more configured evaluators (created via Evaluator Definition)
- One or more target chatflows to evaluate
Theoretical Basis
This principle follows the evaluation pipeline orchestration pattern. The run configuration binds dataset (input/expected pairs) to chatflows (systems under test) and evaluators (scoring criteria). Execution processes each row independently, enabling parallel evaluation and per-row metrics.
Key design aspects:
- Cartesian evaluation: Each dataset row is evaluated against each selected chatflow, producing a matrix of results. This enables direct comparison of chatflow quality under identical test conditions.
- Evaluator composition: Multiple evaluators of different types can be combined in a single run, providing multi-dimensional quality assessment without requiring separate test executions.
- Independent row processing: Each dataset row is processed as an independent test case (unless
datasetAsOneConversationis enabled), ensuring that one test failure does not affect others. - Versioned execution: Each run creates a versioned snapshot of results, enabling longitudinal tracking of chatflow quality.