Principle:CrewAIInc CrewAI Performance Testing
Overview
An evaluation framework that runs the crew multiple times and uses a separate LLM to score each task's output, producing a quantitative performance report.
Description
Performance Testing provides objective measurement of crew quality. The crew executes n times with consistent inputs, and an evaluation LLM (separate from the agents' LLMs) scores each task's output on a 1-10 scale. Results are aggregated into a table showing per-task scores, averages, execution times, and agent assignments across all runs.
This enables data-driven comparison of different configurations, prompts, or agent setups. Rather than relying on subjective human assessment of whether a crew is "good enough," Performance Testing produces quantitative metrics that can be tracked over time and compared across experiments.
The evaluation framework addresses several critical needs in multi-agent system development:
- Reproducibility — Running the crew multiple times with the same inputs reveals variance in output quality. A crew that produces excellent results in one run but poor results in another needs further refinement.
- Objectivity — Using a separate evaluation LLM removes the bias that can occur when human reviewers assess outputs inconsistently across runs.
- Granularity — Scoring happens at the task level, not just at the crew level. This reveals which specific tasks or agents are underperforming, enabling targeted improvements.
- Efficiency — Automated scoring via an evaluation LLM is faster and more cost-effective than manual human evaluation for routine testing.
Theoretical Basis
This principle is grounded in the LLM-as-Judge evaluation methodology, a technique where a separate language model evaluates the quality of another model's outputs. This approach has been validated in research (e.g., MT-Bench, Chatbot Arena) as providing evaluations that correlate well with human judgments while being significantly more scalable.
Key properties of LLM-as-Judge evaluation:
- Scalability — Unlike human evaluation, LLM-based scoring can be applied to thousands of outputs without fatigue or inconsistency
- Consistency — The same evaluation LLM applies the same criteria across all outputs, reducing inter-rater variability
- Cost-effectiveness — A single evaluation LLM call is cheaper than human reviewer time
- Customizability — The evaluation criteria can be adjusted by changing the evaluation prompt
| Metric | Description | Use Case |
|---|---|---|
| Per-task score (1-10) | Quality rating for individual task output | Identify weak tasks/agents |
| Average score | Mean score across all iterations | Overall crew quality metric |
| Execution time | Wall-clock time per iteration | Performance optimization |
| Agent assignment | Which agent executed which task | Verify task routing |
Evaluation Process
The evaluation follows a structured process:
- Execute — Run the crew with provided inputs
- Collect — Gather each task's output along with its description and expected output
- Score — Send each task output to the evaluation LLM with scoring criteria
- Aggregate — Combine scores across iterations into a summary table
- Report — Display results in a formatted table for developer review
Relationship to Workflow
Performance Testing serves as the validation step in the Crew Training and Testing workflow. After Training Execution produces improved agent behaviors, Performance Testing measures whether those improvements are statistically significant. It also works with Baseline Crew Configuration to measure initial crew quality before training begins.
Implementation
Implementation:CrewAIInc_CrewAI_Crew_Test_Method