Principle:Explodinggradients Ragas Tool Call F1 Evaluation
Tool Call F1 Evaluation
Tool Call F1 Evaluation is the principle of measuring the precision and recall of tool calls made by an AI agent relative to a set of expected (reference) tool calls. Unlike order-dependent accuracy metrics, F1 evaluation treats tool calls as sets and assesses whether the agent made the correct calls regardless of their sequence.
Theoretical Foundation
Precision, Recall, and F1 for Tool Calls
When evaluating an AI agent's tool-calling behavior, two complementary questions arise:
- Precision: Of all the tool calls the agent made, how many were correct? This penalizes the agent for making unnecessary or incorrect tool calls.
- Recall: Of all the tool calls that should have been made, how many did the agent actually make? This penalizes the agent for missing required tool calls.
The F1 score is the harmonic mean of precision and recall:
F1 = 2 * (precision * recall) / (precision + recall)
This provides a single balanced metric that rewards agents for both making all necessary tool calls and avoiding unnecessary ones.
Set-Based Comparison
For F1 computation, each tool call is represented as a unique tuple of its name and arguments. Both predicted and reference tool calls are collected into sets, eliminating duplicates. The evaluation then computes:
- True positives (TP): Tool calls present in both the predicted and reference sets
- False positives (FP): Tool calls present in the predicted set but not in the reference set
- False negatives (FN): Tool calls present in the reference set but not in the predicted set
This set-based approach is order-independent by design -- it does not matter in which sequence the agent made its tool calls, only whether the right set of calls was made.
Hashable Representation
To enable set operations, tool call arguments (which may contain nested dictionaries or lists) must be converted to a hashable representation. Dictionary arguments become frozensets of key-value pairs, lists become tuples, and sets become frozensets, applied recursively. This ensures that structurally identical tool calls are recognized as equal regardless of argument ordering within dictionaries.
Relationship to Other Concepts
Tool Call F1 is complementary to Tool Call Accuracy Evaluation. While accuracy evaluates whether the correct tools were called in the correct order with the correct arguments (a stricter criterion), F1 provides a more lenient set-based view that is better suited for scenarios where:
- Tool calls can be executed in parallel
- The order of tool calls does not affect the outcome
- The evaluator wants to separately reason about missed vs. extra tool calls
Both metrics operate on multi-turn conversation samples containing reference tool calls.
Implemented By
- ToolCallF1 Metric -- the Ragas metric class that implements this evaluation principle
See Also
- Implementation:Explodinggradients_Ragas_ToolCallF1_Metric
- Tool Call Accuracy Evaluation -- order-dependent accuracy evaluation of tool calls
- Agent Goal Accuracy Evaluation -- evaluating whether the agent achieved its intended goal
- Multi-Turn Evaluation Schema -- the data schema for multi-turn samples