Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:CrewAIInc CrewAI Crew Training And Testing

From Leeroopedia
Knowledge Sources
Domains Multi_Agent_Systems, Quality_Assurance, Agent_Training
Last Updated 2026-02-11 18:00 GMT

Overview

End-to-end process for training CrewAI agents through iterative execution, testing crew performance with evaluation metrics, and replaying specific tasks for debugging and improvement.

Description

This workflow covers the training and quality assurance capabilities built into CrewAI. The train() method runs the crew multiple times, capturing successful execution patterns that agents can learn from in future runs. The test() method evaluates crew performance across multiple iterations using a configurable evaluation LLM that scores output quality. The replay() method re-executes a specific task from a previous run, enabling targeted debugging and improvement of individual steps. Together, these capabilities form a feedback loop for iteratively improving agent performance on specific task types.

Usage

Execute this workflow when you need to improve agent output quality, benchmark crew performance, or debug specific task failures. Typical triggers include: agents produce inconsistent results that need standardization, you want to measure quality improvement over configuration changes, or a specific task in a crew run produced poor output that needs investigation.

Execution Steps

Step 1: Baseline Crew Configuration

Set up the crew with agents, tasks, and configuration as per the standard sequential or hierarchical workflow. Ensure the crew is functional with a known set of test inputs. Enable verbose logging to capture execution details. This baseline configuration serves as the starting point for training and testing.

Key considerations:

  • Start with a working crew configuration before training
  • Define consistent test inputs for reproducible results
  • Enable memory to accumulate learning across training iterations
  • Set verbose=True for detailed execution logs

Step 2: Training Execution

Call crew.train(n_iterations=N, filename="training_data.pkl", inputs={...}) to run the crew multiple times and capture execution patterns. Each iteration runs the full task sequence, and successful completions are saved to the specified pickle file. The training handler stores context about what approaches worked, enabling agents to reference successful patterns in future runs.

Key considerations:

  • Higher iteration counts capture more diverse execution patterns
  • Training data is persisted to a pickle file for reuse across sessions
  • Each iteration uses the same inputs for consistency
  • Training data includes task descriptions, outputs, and agent reasoning

Step 3: Performance Testing

Call crew.test(n_iterations=N, eval_llm="gpt-4", inputs={...}) to evaluate crew performance. Each iteration runs the full crew and then uses the evaluation LLM to score the output quality. The test method aggregates scores across iterations, providing metrics on consistency, accuracy, and output quality. Results help identify which tasks or agents underperform.

Key considerations:

  • Use a capable LLM (e.g., GPT-4) as the evaluator for accurate scoring
  • Multiple iterations reveal consistency of agent performance
  • Test results include per-task quality scores
  • Compare test results before and after training to measure improvement

Step 4: Task Replay

Use crew.replay(task_id="...", inputs={...}) to re-execute a specific task from a previous run. The task ID identifies the specific task execution to replay. This enables targeted debugging where a single problematic task can be re-run in isolation with modified inputs or agent configuration without re-executing the entire crew.

Key considerations:

  • Task IDs are available from previous crew execution outputs
  • Replay uses the same agent and task configuration as the original run
  • Modified inputs can be provided to test different scenarios
  • Replay output can be compared against the original execution

Step 5: Iterative Improvement

Analyze test results and replay outputs to identify improvement opportunities. Adjust agent backstories, task descriptions, tool assignments, or LLM models based on the findings. Re-run training and testing to validate improvements. This iterative cycle of train, test, analyze, and refine converges toward optimal crew performance for the target use case.

Key considerations:

  • Focus improvements on the lowest-scoring tasks first
  • Small changes to agent backstory can significantly affect output quality
  • Tool additions often improve factual accuracy more than prompt changes
  • Document each iteration's configuration and scores for comparison

Execution Diagram

GitHub URL

Workflow Repository