Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:FlowiseAI Flowise Evaluation Results Analysis

From Leeroopedia
Property Value
Principle Name Evaluation_Results_Analysis
Overview Technique for analyzing evaluation results through metrics dashboards, per-row comparisons, and visual charts
Domain AI Evaluation, Data Visualization, Quality Analysis
Source FlowiseAI/Flowise repository: packages/ui/src/api/evaluations.js, packages/ui/src/views/evaluations/EvaluationResult.jsx
Last Updated 2026-02-12 14:00 GMT

Description

After an evaluation run completes, results are presented through a multi-faceted dashboard that enables comprehensive quality analysis. The results view provides three levels of insight:

Summary Metrics Cards

Aggregate metrics displayed as prominent cards at the top of the results view:

  • Pass count: Number of dataset rows where all evaluators passed.
  • Fail count: Number of dataset rows where one or more evaluators failed.
  • Error count: Number of dataset rows where evaluation encountered an error.
  • Average latency: Mean API response time across all rows.
  • Average cost: Mean token usage cost across all rows.

Visual Charts

Interactive charts provide visual patterns and trends:

  • Pass/Fail pie chart (ChartPassPrnt): Proportional breakdown of pass, fail, and error results.
  • Latency line chart (ChartLatency): Response time distribution across dataset rows.
  • Token usage bar chart (ChartTokens): Prompt and completion token consumption per row.

Per-Row Detail Table

A comprehensive table showing individual results for each dataset row:

  • Input: The original question or prompt from the dataset.
  • Expected output: The ground truth answer from the dataset.
  • Actual output: The chatflow's response (shown per chatflow when multiple chatflows are compared).
  • Evaluator results: Individual pass/fail status for each evaluator applied.

Usage

Use evaluation results analysis when analyzing the quality of chatflow responses after an evaluation run. The multi-level view supports both high-level quality monitoring and detailed root cause analysis of individual failures.

Theoretical Basis

This principle follows a multi-dimensional metrics visualization approach. The design employs three complementary analysis strategies:

  • Aggregate metrics provide high-level quality signals, enabling quick assessment of overall chatflow health. Pass/fail ratios and average metrics serve as leading indicators.
  • Visual charts reveal patterns and distributions that aggregate numbers obscure. Latency spikes, token usage outliers, and pass/fail proportions become immediately apparent through visual representation.
  • Per-row drill-down enables root cause analysis of failures. When aggregate metrics indicate a problem, the detail table allows identification of specific inputs that cause failures, guiding targeted chatflow improvements.
  • Multi-chatflow comparison enables A/B testing by displaying results for multiple chatflows side-by-side under identical test conditions.

The combination of these three levels follows the Overview-Zoom-Detail visualization pattern, where users start with a high-level summary and progressively drill into specifics as needed.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment