Workflow:CrewAIInc CrewAI Crew Training And Testing

Knowledge Sources	CrewAI CrewAI Docs
Domains	Multi_Agent_Systems, Quality_Assurance, Agent_Training
Last Updated	2026-02-11 18:00 GMT

Overview

End-to-end process for training CrewAI agents through iterative execution, testing crew performance with evaluation metrics, and replaying specific tasks for debugging and improvement.

Description

This workflow covers the training and quality assurance capabilities built into CrewAI. The train() method runs the crew multiple times, capturing successful execution patterns that agents can learn from in future runs. The test() method evaluates crew performance across multiple iterations using a configurable evaluation LLM that scores output quality. The replay() method re-executes a specific task from a previous run, enabling targeted debugging and improvement of individual steps. Together, these capabilities form a feedback loop for iteratively improving agent performance on specific task types.

Usage

Execute this workflow when you need to improve agent output quality, benchmark crew performance, or debug specific task failures. Typical triggers include: agents produce inconsistent results that need standardization, you want to measure quality improvement over configuration changes, or a specific task in a crew run produced poor output that needs investigation.

Execution Steps

Step 1: Baseline Crew Configuration

Set up the crew with agents, tasks, and configuration as per the standard sequential or hierarchical workflow. Ensure the crew is functional with a known set of test inputs. Enable verbose logging to capture execution details. This baseline configuration serves as the starting point for training and testing.

Key considerations:

Start with a working crew configuration before training
Define consistent test inputs for reproducible results
Enable memory to accumulate learning across training iterations
Set verbose=True for detailed execution logs

Step 2: Training Execution

Call crew.train(n_iterations=N, filename="training_data.pkl", inputs={...}) to run the crew multiple times and capture execution patterns. Each iteration runs the full task sequence, and successful completions are saved to the specified pickle file. The training handler stores context about what approaches worked, enabling agents to reference successful patterns in future runs.

Key considerations:

Higher iteration counts capture more diverse execution patterns
Training data is persisted to a pickle file for reuse across sessions
Each iteration uses the same inputs for consistency
Training data includes task descriptions, outputs, and agent reasoning

Step 3: Performance Testing

Call crew.test(n_iterations=N, eval_llm="gpt-4", inputs={...}) to evaluate crew performance. Each iteration runs the full crew and then uses the evaluation LLM to score the output quality. The test method aggregates scores across iterations, providing metrics on consistency, accuracy, and output quality. Results help identify which tasks or agents underperform.

Key considerations:

Use a capable LLM (e.g., GPT-4) as the evaluator for accurate scoring
Multiple iterations reveal consistency of agent performance
Test results include per-task quality scores
Compare test results before and after training to measure improvement

Step 4: Task Replay

Use crew.replay(task_id="...", inputs={...}) to re-execute a specific task from a previous run. The task ID identifies the specific task execution to replay. This enables targeted debugging where a single problematic task can be re-run in isolation with modified inputs or agent configuration without re-executing the entire crew.

Key considerations:

Task IDs are available from previous crew execution outputs
Replay uses the same agent and task configuration as the original run
Modified inputs can be provided to test different scenarios
Replay output can be compared against the original execution

Step 5: Iterative Improvement

Analyze test results and replay outputs to identify improvement opportunities. Adjust agent backstories, task descriptions, tool assignments, or LLM models based on the findings. Re-run training and testing to validate improvements. This iterative cycle of train, test, analyze, and refine converges toward optimal crew performance for the target use case.

Key considerations:

Focus improvements on the lowest-scoring tasks first
Small changes to agent backstory can significantly affect output quality
Tool additions often improve factual accuracy more than prompt changes
Document each iteration's configuration and scores for comparison

Execution Diagram

GitHub URL

Workflow Repository