Implementation:Explodinggradients Ragas ToolCallF1 Metric

ToolCallF1 Metric

ToolCallF1 is a multi-turn evaluation metric in the Ragas library that computes the F1 score of an AI agent's tool calls by comparing predicted tool calls (extracted from conversation messages) against a reference set of expected tool calls. The comparison is set-based and order-independent.

Source Location

File: src/ragas/metrics/_tool_call_f1.py (lines 25-71)
Repository: explodinggradients/ragas

Import

from ragas.metrics import ToolCallF1

Class Definition

@dataclass
class ToolCallF1(MultiTurnMetric):
    name: str = "tool_call_f1"
    batch_size: int = 1
    is_multi_turn: bool = True
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {
            MetricType.MULTI_TURN: {
                "reference_tool_calls",
                "user_input",
            }
        }
    )

Constructor Parameters

Parameter	Type	Default	Description
`name`	`str`	`"tool_call_f1"`	The name identifier for this metric.
`batch_size`	`int`	`1`	Batch size for evaluation processing.

Required Columns

The metric requires a MultiTurnSample with the following fields:

user_input -- list of conversation messages (from which predicted tool calls in AIMessage objects are extracted)
reference_tool_calls -- list of expected ToolCall objects

Key Methods

_multi_turn_ascore

async def _multi_turn_ascore(
    self, sample: MultiTurnSample, callbacks: t.Optional[Callbacks] = None
) -> float

The primary scoring method. It:

Builds the expected set from sample.reference_tool_calls by converting each tool call into a hashable tuple of (name, frozenset_of_args)
Builds the actual set by iterating over sample.user_input, extracting tool calls from every AIMessage that has tool_calls, and converting them to the same hashable representation
Computes set intersections and differences to determine true positives (TP), false positives (FP), and false negatives (FN)
Calculates precision, recall, and F1:

tp = len(actual & expected)
fp = len(actual - expected)
fn = len(expected - actual)

precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

The result is rounded to 4 decimal places.

Helper Function: _make_hashable

def _make_hashable(obj: t.Any) -> t.Any

A module-level utility function (lines 14-22) that recursively converts arbitrary Python objects into hashable representations suitable for set operations:

dict becomes frozenset of key-value pairs (recursively)
list and tuple become tuple (recursively)
set becomes frozenset (recursively)
All other types are returned as-is

This enables tool calls with nested dictionary or list arguments to be compared using Python set operations.

Usage Example

from ragas.metrics import ToolCallF1
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage, AIMessage, ToolCall, ToolMessage

sample = MultiTurnSample(
    user_input=[
        HumanMessage(content="Find Chinese restaurants and book one"),
        AIMessage(
            content="Searching...",
            tool_calls=[
                ToolCall(name="restaurant_search", args={"cuisine": "Chinese"}),
                ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})
            ]
        ),
        ToolMessage(content="Found and booked."),
        AIMessage(content="Done!")
    ],
    reference_tool_calls=[
        ToolCall(name="restaurant_search", args={"cuisine": "Chinese"}),
        ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})
    ]
)

metric = ToolCallF1()
# score = await metric._multi_turn_ascore(sample)
# Expected: 1.0 (perfect precision and recall)

Score Interpretation

Score	Meaning
1.0	All reference tool calls were made, and no extra tool calls were made
0.0	No overlap between predicted and reference tool calls
0.0 < score < 1.0	Some tool calls matched; there may be missed calls (lower recall) or extra calls (lower precision)

Internal Dependencies

ragas.metrics.base.MultiTurnMetric -- base class providing the multi-turn metric interface
ragas.dataset_schema.MultiTurnSample -- input sample schema
ragas.messages.AIMessage -- message type from which tool calls are extracted

Implements

Principle:Explodinggradients_Ragas_Tool_Call_F1_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment