Principle:FlowiseAI Flowise Evaluation Dataset Creation

Property	Value
Principle Name	Evaluation_Dataset_Creation
Overview	Technique for creating structured input-output test datasets for systematic evaluation of AI chatflow quality
Domain	AI Evaluation, Test Suite Design, Quality Assurance
Source	FlowiseAI/Flowise repository: packages/ui/src/api/dataset.js
Last Updated	2026-02-12 14:00 GMT

Description

Evaluation datasets contain pairs of input prompts and expected outputs that serve as ground truth for measuring chatflow performance. Datasets can be created manually (row by row) or via bulk CSV upload. Each row has an input (the question to send) and expectedOutput (the correct or desired answer). These datasets are the foundation for automated and LLM-graded evaluation.

The dataset creation process involves two stages:

Dataset creation: Define a named dataset container with an optional description and CSV upload configuration.
Row creation: Populate the dataset with individual input/output pairs, either one at a time or in bulk through CSV import.

Once a dataset is populated, it can be reused across multiple evaluation runs, enabling consistent testing conditions for different chatflows and evaluator configurations.

Usage

Use evaluation dataset creation when building a test suite for evaluating chatflow response quality and accuracy. This is the first step in the Evaluation Pipeline workflow:

Create a dataset to define the scope of testing
Add rows representing distinct test cases with expected outputs
Reference the dataset when configuring evaluation runs

Theoretical Basis

This principle follows the test suite design pattern for AI systems. Unlike traditional unit tests with deterministic assertions, AI evaluation datasets capture intent through expected outputs that are compared using fuzzy matching, semantic similarity, or LLM-based grading.

Key characteristics of evaluation datasets:

Input-output pairing: Each test case binds a prompt to a reference answer, establishing the ground truth for comparison.
Non-deterministic evaluation: Because LLM outputs vary, expected outputs are treated as reference points rather than exact matches. Evaluators apply flexible comparison strategies (text matching, semantic similarity, LLM grading).
Reusability: A single dataset can be applied across multiple chatflows and evaluation runs, ensuring consistent test conditions.
Scalability: CSV bulk upload supports rapid construction of large test suites covering diverse scenarios.
Versioned testing: The same dataset used across evaluation re-runs ensures that improvements or regressions are measured against a fixed baseline.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment