Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:FlowiseAI Flowise Evaluation Run Creation

From Leeroopedia
Property Value
Principle Name Evaluation_Run_Creation
Overview Technique for configuring and executing systematic evaluation runs that test chatflows against datasets using defined evaluators
Domain AI Evaluation, Pipeline Orchestration, Quality Assurance
Source FlowiseAI/Flowise repository: packages/ui/src/api/evaluations.js
Last Updated 2026-02-12 14:00 GMT

Description

An evaluation run combines a dataset, one or more chatflows, and selected evaluators into a single test execution. The system processes each dataset row through each chatflow, then applies all selected evaluators to the results. Multiple chatflows can be compared in a single run, enabling side-by-side quality assessment.

The evaluation run configuration binds together:

  • Dataset: The input/expected-output pairs that serve as test cases.
  • Chatflows: One or more chatflows (systems under test) that process each input.
  • Simple evaluators: Text, JSON, and numeric evaluators applied to each response.
  • LLM evaluators: AI-based grading evaluators that assess response quality using a separate model.

The run supports two evaluation types:

  • llm: Uses LLM-based evaluators to grade responses.
  • benchmarking: Uses simple evaluators (text, JSON, numeric) to check responses.

An optional datasetAsOneConversation flag controls whether all dataset rows are sent as a continuous conversation or as independent queries.

Usage

Use evaluation run creation when executing a systematic quality assessment of one or more chatflows against a test dataset. This is the central orchestration step in the Evaluation Pipeline workflow, requiring:

  • A pre-existing dataset (created via Dataset Creation)
  • One or more configured evaluators (created via Evaluator Definition)
  • One or more target chatflows to evaluate

Theoretical Basis

This principle follows the evaluation pipeline orchestration pattern. The run configuration binds dataset (input/expected pairs) to chatflows (systems under test) and evaluators (scoring criteria). Execution processes each row independently, enabling parallel evaluation and per-row metrics.

Key design aspects:

  • Cartesian evaluation: Each dataset row is evaluated against each selected chatflow, producing a matrix of results. This enables direct comparison of chatflow quality under identical test conditions.
  • Evaluator composition: Multiple evaluators of different types can be combined in a single run, providing multi-dimensional quality assessment without requiring separate test executions.
  • Independent row processing: Each dataset row is processed as an independent test case (unless datasetAsOneConversation is enabled), ensuring that one test failure does not affect others.
  • Versioned execution: Each run creates a versioned snapshot of results, enabling longitudinal tracking of chatflow quality.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment