Workflow:CrewAIInc CrewAI Crew Training And Testing
| Knowledge Sources | |
|---|---|
| Domains | Multi_Agent_Systems, Quality_Assurance, Agent_Training |
| Last Updated | 2026-02-11 18:00 GMT |
Overview
End-to-end process for training CrewAI agents through iterative execution, testing crew performance with evaluation metrics, and replaying specific tasks for debugging and improvement.
Description
This workflow covers the training and quality assurance capabilities built into CrewAI. The train() method runs the crew multiple times, capturing successful execution patterns that agents can learn from in future runs. The test() method evaluates crew performance across multiple iterations using a configurable evaluation LLM that scores output quality. The replay() method re-executes a specific task from a previous run, enabling targeted debugging and improvement of individual steps. Together, these capabilities form a feedback loop for iteratively improving agent performance on specific task types.
Usage
Execute this workflow when you need to improve agent output quality, benchmark crew performance, or debug specific task failures. Typical triggers include: agents produce inconsistent results that need standardization, you want to measure quality improvement over configuration changes, or a specific task in a crew run produced poor output that needs investigation.
Execution Steps
Step 1: Baseline Crew Configuration
Set up the crew with agents, tasks, and configuration as per the standard sequential or hierarchical workflow. Ensure the crew is functional with a known set of test inputs. Enable verbose logging to capture execution details. This baseline configuration serves as the starting point for training and testing.
Key considerations:
- Start with a working crew configuration before training
- Define consistent test inputs for reproducible results
- Enable memory to accumulate learning across training iterations
- Set verbose=True for detailed execution logs
Step 2: Training Execution
Call crew.train(n_iterations=N, filename="training_data.pkl", inputs={...}) to run the crew multiple times and capture execution patterns. Each iteration runs the full task sequence, and successful completions are saved to the specified pickle file. The training handler stores context about what approaches worked, enabling agents to reference successful patterns in future runs.
Key considerations:
- Higher iteration counts capture more diverse execution patterns
- Training data is persisted to a pickle file for reuse across sessions
- Each iteration uses the same inputs for consistency
- Training data includes task descriptions, outputs, and agent reasoning
Step 3: Performance Testing
Call crew.test(n_iterations=N, eval_llm="gpt-4", inputs={...}) to evaluate crew performance. Each iteration runs the full crew and then uses the evaluation LLM to score the output quality. The test method aggregates scores across iterations, providing metrics on consistency, accuracy, and output quality. Results help identify which tasks or agents underperform.
Key considerations:
- Use a capable LLM (e.g., GPT-4) as the evaluator for accurate scoring
- Multiple iterations reveal consistency of agent performance
- Test results include per-task quality scores
- Compare test results before and after training to measure improvement
Step 4: Task Replay
Use crew.replay(task_id="...", inputs={...}) to re-execute a specific task from a previous run. The task ID identifies the specific task execution to replay. This enables targeted debugging where a single problematic task can be re-run in isolation with modified inputs or agent configuration without re-executing the entire crew.
Key considerations:
- Task IDs are available from previous crew execution outputs
- Replay uses the same agent and task configuration as the original run
- Modified inputs can be provided to test different scenarios
- Replay output can be compared against the original execution
Step 5: Iterative Improvement
Analyze test results and replay outputs to identify improvement opportunities. Adjust agent backstories, task descriptions, tool assignments, or LLM models based on the findings. Re-run training and testing to validate improvements. This iterative cycle of train, test, analyze, and refine converges toward optimal crew performance for the target use case.
Key considerations:
- Focus improvements on the lowest-scoring tasks first
- Small changes to agent backstory can significantly affect output quality
- Tool additions often improve factual accuracy more than prompt changes
- Document each iteration's configuration and scores for comparison