Implementation:Explodinggradients Ragas ToolCallF1 Metric
ToolCallF1 Metric
ToolCallF1 is a multi-turn evaluation metric in the Ragas library that computes the F1 score of an AI agent's tool calls by comparing predicted tool calls (extracted from conversation messages) against a reference set of expected tool calls. The comparison is set-based and order-independent.
Source Location
- File:
src/ragas/metrics/_tool_call_f1.py(lines 25-71) - Repository: explodinggradients/ragas
Import
from ragas.metrics import ToolCallF1
Class Definition
@dataclass
class ToolCallF1(MultiTurnMetric):
name: str = "tool_call_f1"
batch_size: int = 1
is_multi_turn: bool = True
_required_columns: t.Dict[MetricType, t.Set[str]] = field(
default_factory=lambda: {
MetricType.MULTI_TURN: {
"reference_tool_calls",
"user_input",
}
}
)
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str |
"tool_call_f1" |
The name identifier for this metric. |
batch_size |
int |
1 |
Batch size for evaluation processing. |
Required Columns
The metric requires a MultiTurnSample with the following fields:
user_input-- list of conversation messages (from which predicted tool calls inAIMessageobjects are extracted)reference_tool_calls-- list of expectedToolCallobjects
Key Methods
_multi_turn_ascore
async def _multi_turn_ascore(
self, sample: MultiTurnSample, callbacks: t.Optional[Callbacks] = None
) -> float
The primary scoring method. It:
- Builds the expected set from
sample.reference_tool_callsby converting each tool call into a hashable tuple of(name, frozenset_of_args) - Builds the actual set by iterating over
sample.user_input, extracting tool calls from everyAIMessagethat hastool_calls, and converting them to the same hashable representation - Computes set intersections and differences to determine true positives (TP), false positives (FP), and false negatives (FN)
- Calculates precision, recall, and F1:
tp = len(actual & expected)
fp = len(actual - expected)
fn = len(expected - actual)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
The result is rounded to 4 decimal places.
Helper Function: _make_hashable
def _make_hashable(obj: t.Any) -> t.Any
A module-level utility function (lines 14-22) that recursively converts arbitrary Python objects into hashable representations suitable for set operations:
dictbecomesfrozensetof key-value pairs (recursively)listandtuplebecometuple(recursively)setbecomesfrozenset(recursively)- All other types are returned as-is
This enables tool calls with nested dictionary or list arguments to be compared using Python set operations.
Usage Example
from ragas.metrics import ToolCallF1
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage, AIMessage, ToolCall, ToolMessage
sample = MultiTurnSample(
user_input=[
HumanMessage(content="Find Chinese restaurants and book one"),
AIMessage(
content="Searching...",
tool_calls=[
ToolCall(name="restaurant_search", args={"cuisine": "Chinese"}),
ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})
]
),
ToolMessage(content="Found and booked."),
AIMessage(content="Done!")
],
reference_tool_calls=[
ToolCall(name="restaurant_search", args={"cuisine": "Chinese"}),
ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})
]
)
metric = ToolCallF1()
# score = await metric._multi_turn_ascore(sample)
# Expected: 1.0 (perfect precision and recall)
Score Interpretation
| Score | Meaning |
|---|---|
| 1.0 | All reference tool calls were made, and no extra tool calls were made |
| 0.0 | No overlap between predicted and reference tool calls |
| 0.0 < score < 1.0 | Some tool calls matched; there may be missed calls (lower recall) or extra calls (lower precision) |
Internal Dependencies
ragas.metrics.base.MultiTurnMetric-- base class providing the multi-turn metric interfaceragas.dataset_schema.MultiTurnSample-- input sample schemaragas.messages.AIMessage-- message type from which tool calls are extracted
Implements
See Also
- ToolCallAccuracy Metric -- order-dependent tool call accuracy evaluation
- MultiTurnSample Class -- the data schema for multi-turn evaluation samples