Principle:FlowiseAI Flowise Evaluation Results Analysis
| Property | Value |
|---|---|
| Principle Name | Evaluation_Results_Analysis |
| Overview | Technique for analyzing evaluation results through metrics dashboards, per-row comparisons, and visual charts |
| Domain | AI Evaluation, Data Visualization, Quality Analysis |
| Source | FlowiseAI/Flowise repository: packages/ui/src/api/evaluations.js, packages/ui/src/views/evaluations/EvaluationResult.jsx |
| Last Updated | 2026-02-12 14:00 GMT |
Description
After an evaluation run completes, results are presented through a multi-faceted dashboard that enables comprehensive quality analysis. The results view provides three levels of insight:
Summary Metrics Cards
Aggregate metrics displayed as prominent cards at the top of the results view:
- Pass count: Number of dataset rows where all evaluators passed.
- Fail count: Number of dataset rows where one or more evaluators failed.
- Error count: Number of dataset rows where evaluation encountered an error.
- Average latency: Mean API response time across all rows.
- Average cost: Mean token usage cost across all rows.
Visual Charts
Interactive charts provide visual patterns and trends:
- Pass/Fail pie chart (
ChartPassPrnt): Proportional breakdown of pass, fail, and error results. - Latency line chart (
ChartLatency): Response time distribution across dataset rows. - Token usage bar chart (
ChartTokens): Prompt and completion token consumption per row.
Per-Row Detail Table
A comprehensive table showing individual results for each dataset row:
- Input: The original question or prompt from the dataset.
- Expected output: The ground truth answer from the dataset.
- Actual output: The chatflow's response (shown per chatflow when multiple chatflows are compared).
- Evaluator results: Individual pass/fail status for each evaluator applied.
Usage
Use evaluation results analysis when analyzing the quality of chatflow responses after an evaluation run. The multi-level view supports both high-level quality monitoring and detailed root cause analysis of individual failures.
Theoretical Basis
This principle follows a multi-dimensional metrics visualization approach. The design employs three complementary analysis strategies:
- Aggregate metrics provide high-level quality signals, enabling quick assessment of overall chatflow health. Pass/fail ratios and average metrics serve as leading indicators.
- Visual charts reveal patterns and distributions that aggregate numbers obscure. Latency spikes, token usage outliers, and pass/fail proportions become immediately apparent through visual representation.
- Per-row drill-down enables root cause analysis of failures. When aggregate metrics indicate a problem, the detail table allows identification of specific inputs that cause failures, guiding targeted chatflow improvements.
- Multi-chatflow comparison enables A/B testing by displaying results for multiple chatflows side-by-side under identical test conditions.
The combination of these three levels follows the Overview-Zoom-Detail visualization pattern, where users start with a high-level summary and progressively drill into specifics as needed.