Principle:Explodinggradients Ragas Tool Call F1 Evaluation

Tool Call F1 Evaluation

Tool Call F1 Evaluation is the principle of measuring the precision and recall of tool calls made by an AI agent relative to a set of expected (reference) tool calls. Unlike order-dependent accuracy metrics, F1 evaluation treats tool calls as sets and assesses whether the agent made the correct calls regardless of their sequence.

Theoretical Foundation

Precision, Recall, and F1 for Tool Calls

When evaluating an AI agent's tool-calling behavior, two complementary questions arise:

Precision: Of all the tool calls the agent made, how many were correct? This penalizes the agent for making unnecessary or incorrect tool calls.
Recall: Of all the tool calls that should have been made, how many did the agent actually make? This penalizes the agent for missing required tool calls.

The F1 score is the harmonic mean of precision and recall:

F1 = 2 * (precision * recall) / (precision + recall)

This provides a single balanced metric that rewards agents for both making all necessary tool calls and avoiding unnecessary ones.

Set-Based Comparison

For F1 computation, each tool call is represented as a unique tuple of its name and arguments. Both predicted and reference tool calls are collected into sets, eliminating duplicates. The evaluation then computes:

True positives (TP): Tool calls present in both the predicted and reference sets
False positives (FP): Tool calls present in the predicted set but not in the reference set
False negatives (FN): Tool calls present in the reference set but not in the predicted set

This set-based approach is order-independent by design -- it does not matter in which sequence the agent made its tool calls, only whether the right set of calls was made.

Hashable Representation

To enable set operations, tool call arguments (which may contain nested dictionaries or lists) must be converted to a hashable representation. Dictionary arguments become frozensets of key-value pairs, lists become tuples, and sets become frozensets, applied recursively. This ensures that structurally identical tool calls are recognized as equal regardless of argument ordering within dictionaries.

Relationship to Other Concepts

Tool Call F1 is complementary to Tool Call Accuracy Evaluation. While accuracy evaluates whether the correct tools were called in the correct order with the correct arguments (a stricter criterion), F1 provides a more lenient set-based view that is better suited for scenarios where:

Tool calls can be executed in parallel
The order of tool calls does not affect the outcome
The evaluator wants to separately reason about missed vs. extra tool calls

Both metrics operate on multi-turn conversation samples containing reference tool calls.

Implemented By

ToolCallF1 Metric -- the Ragas metric class that implements this evaluation principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment