Principle:FlowiseAI Flowise Evaluation Dataset Creation
| Property | Value |
|---|---|
| Principle Name | Evaluation_Dataset_Creation |
| Overview | Technique for creating structured input-output test datasets for systematic evaluation of AI chatflow quality |
| Domain | AI Evaluation, Test Suite Design, Quality Assurance |
| Source | FlowiseAI/Flowise repository: packages/ui/src/api/dataset.js |
| Last Updated | 2026-02-12 14:00 GMT |
Description
Evaluation datasets contain pairs of input prompts and expected outputs that serve as ground truth for measuring chatflow performance. Datasets can be created manually (row by row) or via bulk CSV upload. Each row has an input (the question to send) and expectedOutput (the correct or desired answer). These datasets are the foundation for automated and LLM-graded evaluation.
The dataset creation process involves two stages:
- Dataset creation: Define a named dataset container with an optional description and CSV upload configuration.
- Row creation: Populate the dataset with individual input/output pairs, either one at a time or in bulk through CSV import.
Once a dataset is populated, it can be reused across multiple evaluation runs, enabling consistent testing conditions for different chatflows and evaluator configurations.
Usage
Use evaluation dataset creation when building a test suite for evaluating chatflow response quality and accuracy. This is the first step in the Evaluation Pipeline workflow:
- Create a dataset to define the scope of testing
- Add rows representing distinct test cases with expected outputs
- Reference the dataset when configuring evaluation runs
Theoretical Basis
This principle follows the test suite design pattern for AI systems. Unlike traditional unit tests with deterministic assertions, AI evaluation datasets capture intent through expected outputs that are compared using fuzzy matching, semantic similarity, or LLM-based grading.
Key characteristics of evaluation datasets:
- Input-output pairing: Each test case binds a prompt to a reference answer, establishing the ground truth for comparison.
- Non-deterministic evaluation: Because LLM outputs vary, expected outputs are treated as reference points rather than exact matches. Evaluators apply flexible comparison strategies (text matching, semantic similarity, LLM grading).
- Reusability: A single dataset can be applied across multiple chatflows and evaluation runs, ensuring consistent test conditions.
- Scalability: CSV bulk upload supports rapid construction of large test suites covering diverse scenarios.
- Versioned testing: The same dataset used across evaluation re-runs ensures that improvements or regressions are measured against a fixed baseline.