Principle:CrewAIInc CrewAI Performance Testing

Overview

An evaluation framework that runs the crew multiple times and uses a separate LLM to score each task's output, producing a quantitative performance report.

Description

Performance Testing provides objective measurement of crew quality. The crew executes n times with consistent inputs, and an evaluation LLM (separate from the agents' LLMs) scores each task's output on a 1-10 scale. Results are aggregated into a table showing per-task scores, averages, execution times, and agent assignments across all runs.

This enables data-driven comparison of different configurations, prompts, or agent setups. Rather than relying on subjective human assessment of whether a crew is "good enough," Performance Testing produces quantitative metrics that can be tracked over time and compared across experiments.

The evaluation framework addresses several critical needs in multi-agent system development:

Reproducibility — Running the crew multiple times with the same inputs reveals variance in output quality. A crew that produces excellent results in one run but poor results in another needs further refinement.
Objectivity — Using a separate evaluation LLM removes the bias that can occur when human reviewers assess outputs inconsistently across runs.
Granularity — Scoring happens at the task level, not just at the crew level. This reveals which specific tasks or agents are underperforming, enabling targeted improvements.
Efficiency — Automated scoring via an evaluation LLM is faster and more cost-effective than manual human evaluation for routine testing.

Theoretical Basis

This principle is grounded in the LLM-as-Judge evaluation methodology, a technique where a separate language model evaluates the quality of another model's outputs. This approach has been validated in research (e.g., MT-Bench, Chatbot Arena) as providing evaluations that correlate well with human judgments while being significantly more scalable.

Key properties of LLM-as-Judge evaluation:

Scalability — Unlike human evaluation, LLM-based scoring can be applied to thousands of outputs without fatigue or inconsistency
Consistency — The same evaluation LLM applies the same criteria across all outputs, reducing inter-rater variability
Cost-effectiveness — A single evaluation LLM call is cheaper than human reviewer time
Customizability — The evaluation criteria can be adjusted by changing the evaluation prompt

Metric	Description	Use Case
Per-task score (1-10)	Quality rating for individual task output	Identify weak tasks/agents
Average score	Mean score across all iterations	Overall crew quality metric
Execution time	Wall-clock time per iteration	Performance optimization
Agent assignment	Which agent executed which task	Verify task routing

Evaluation Process

The evaluation follows a structured process:

Execute — Run the crew with provided inputs
Collect — Gather each task's output along with its description and expected output
Score — Send each task output to the evaluation LLM with scoring criteria
Aggregate — Combine scores across iterations into a summary table
Report — Display results in a formatted table for developer review

Relationship to Workflow

Performance Testing serves as the validation step in the Crew Training and Testing workflow. After Training Execution produces improved agent behaviors, Performance Testing measures whether those improvements are statistically significant. It also works with Baseline Crew Configuration to measure initial crew quality before training begins.

Implementation

Implementation:CrewAIInc_CrewAI_Crew_Test_Method

References

crewAI GitHub Repository

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment