Principle:FlowiseAI Flowise Evaluation Results Analysis

Property	Value
Principle Name	Evaluation_Results_Analysis
Overview	Technique for analyzing evaluation results through metrics dashboards, per-row comparisons, and visual charts
Domain	AI Evaluation, Data Visualization, Quality Analysis
Source	FlowiseAI/Flowise repository: packages/ui/src/api/evaluations.js, packages/ui/src/views/evaluations/EvaluationResult.jsx
Last Updated	2026-02-12 14:00 GMT

Description

After an evaluation run completes, results are presented through a multi-faceted dashboard that enables comprehensive quality analysis. The results view provides three levels of insight:

Summary Metrics Cards

Aggregate metrics displayed as prominent cards at the top of the results view:

Pass count: Number of dataset rows where all evaluators passed.
Fail count: Number of dataset rows where one or more evaluators failed.
Error count: Number of dataset rows where evaluation encountered an error.
Average latency: Mean API response time across all rows.
Average cost: Mean token usage cost across all rows.

Visual Charts

Interactive charts provide visual patterns and trends:

Pass/Fail pie chart (ChartPassPrnt): Proportional breakdown of pass, fail, and error results.
Latency line chart (ChartLatency): Response time distribution across dataset rows.
Token usage bar chart (ChartTokens): Prompt and completion token consumption per row.

Per-Row Detail Table

A comprehensive table showing individual results for each dataset row:

Input: The original question or prompt from the dataset.
Expected output: The ground truth answer from the dataset.
Actual output: The chatflow's response (shown per chatflow when multiple chatflows are compared).
Evaluator results: Individual pass/fail status for each evaluator applied.

Usage

Use evaluation results analysis when analyzing the quality of chatflow responses after an evaluation run. The multi-level view supports both high-level quality monitoring and detailed root cause analysis of individual failures.

Theoretical Basis

This principle follows a multi-dimensional metrics visualization approach. The design employs three complementary analysis strategies:

Aggregate metrics provide high-level quality signals, enabling quick assessment of overall chatflow health. Pass/fail ratios and average metrics serve as leading indicators.
Visual charts reveal patterns and distributions that aggregate numbers obscure. Latency spikes, token usage outliers, and pass/fail proportions become immediately apparent through visual representation.
Per-row drill-down enables root cause analysis of failures. When aggregate metrics indicate a problem, the detail table allows identification of specific inputs that cause failures, guiding targeted chatflow improvements.
Multi-chatflow comparison enables A/B testing by displaying results for multiple chatflows side-by-side under identical test conditions.

The combination of these three levels follows the Overview-Zoom-Detail visualization pattern, where users start with a high-level summary and progressively drill into specifics as needed.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment