Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:FlowiseAI Flowise Evaluation Dataset Creation

From Leeroopedia
Property Value
Principle Name Evaluation_Dataset_Creation
Overview Technique for creating structured input-output test datasets for systematic evaluation of AI chatflow quality
Domain AI Evaluation, Test Suite Design, Quality Assurance
Source FlowiseAI/Flowise repository: packages/ui/src/api/dataset.js
Last Updated 2026-02-12 14:00 GMT

Description

Evaluation datasets contain pairs of input prompts and expected outputs that serve as ground truth for measuring chatflow performance. Datasets can be created manually (row by row) or via bulk CSV upload. Each row has an input (the question to send) and expectedOutput (the correct or desired answer). These datasets are the foundation for automated and LLM-graded evaluation.

The dataset creation process involves two stages:

  • Dataset creation: Define a named dataset container with an optional description and CSV upload configuration.
  • Row creation: Populate the dataset with individual input/output pairs, either one at a time or in bulk through CSV import.

Once a dataset is populated, it can be reused across multiple evaluation runs, enabling consistent testing conditions for different chatflows and evaluator configurations.

Usage

Use evaluation dataset creation when building a test suite for evaluating chatflow response quality and accuracy. This is the first step in the Evaluation Pipeline workflow:

  • Create a dataset to define the scope of testing
  • Add rows representing distinct test cases with expected outputs
  • Reference the dataset when configuring evaluation runs

Theoretical Basis

This principle follows the test suite design pattern for AI systems. Unlike traditional unit tests with deterministic assertions, AI evaluation datasets capture intent through expected outputs that are compared using fuzzy matching, semantic similarity, or LLM-based grading.

Key characteristics of evaluation datasets:

  • Input-output pairing: Each test case binds a prompt to a reference answer, establishing the ground truth for comparison.
  • Non-deterministic evaluation: Because LLM outputs vary, expected outputs are treated as reference points rather than exact matches. Evaluators apply flexible comparison strategies (text matching, semantic similarity, LLM grading).
  • Reusability: A single dataset can be applied across multiple chatflows and evaluation runs, ensuring consistent test conditions.
  • Scalability: CSV bulk upload supports rapid construction of large test suites covering diverse scenarios.
  • Versioned testing: The same dataset used across evaluation re-runs ensures that improvements or regressions are measured against a fixed baseline.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment