Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Explodinggradients Ragas ToolCallF1 Metric

From Leeroopedia


ToolCallF1 Metric

ToolCallF1 is a multi-turn evaluation metric in the Ragas library that computes the F1 score of an AI agent's tool calls by comparing predicted tool calls (extracted from conversation messages) against a reference set of expected tool calls. The comparison is set-based and order-independent.

Source Location

Import

from ragas.metrics import ToolCallF1

Class Definition

@dataclass
class ToolCallF1(MultiTurnMetric):
    name: str = "tool_call_f1"
    batch_size: int = 1
    is_multi_turn: bool = True
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {
            MetricType.MULTI_TURN: {
                "reference_tool_calls",
                "user_input",
            }
        }
    )

Constructor Parameters

Parameter Type Default Description
name str "tool_call_f1" The name identifier for this metric.
batch_size int 1 Batch size for evaluation processing.

Required Columns

The metric requires a MultiTurnSample with the following fields:

  • user_input -- list of conversation messages (from which predicted tool calls in AIMessage objects are extracted)
  • reference_tool_calls -- list of expected ToolCall objects

Key Methods

_multi_turn_ascore

async def _multi_turn_ascore(
    self, sample: MultiTurnSample, callbacks: t.Optional[Callbacks] = None
) -> float

The primary scoring method. It:

  1. Builds the expected set from sample.reference_tool_calls by converting each tool call into a hashable tuple of (name, frozenset_of_args)
  2. Builds the actual set by iterating over sample.user_input, extracting tool calls from every AIMessage that has tool_calls, and converting them to the same hashable representation
  3. Computes set intersections and differences to determine true positives (TP), false positives (FP), and false negatives (FN)
  4. Calculates precision, recall, and F1:
tp = len(actual & expected)
fp = len(actual - expected)
fn = len(expected - actual)

precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

The result is rounded to 4 decimal places.

Helper Function: _make_hashable

def _make_hashable(obj: t.Any) -> t.Any

A module-level utility function (lines 14-22) that recursively converts arbitrary Python objects into hashable representations suitable for set operations:

  • dict becomes frozenset of key-value pairs (recursively)
  • list and tuple become tuple (recursively)
  • set becomes frozenset (recursively)
  • All other types are returned as-is

This enables tool calls with nested dictionary or list arguments to be compared using Python set operations.

Usage Example

from ragas.metrics import ToolCallF1
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage, AIMessage, ToolCall, ToolMessage

sample = MultiTurnSample(
    user_input=[
        HumanMessage(content="Find Chinese restaurants and book one"),
        AIMessage(
            content="Searching...",
            tool_calls=[
                ToolCall(name="restaurant_search", args={"cuisine": "Chinese"}),
                ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})
            ]
        ),
        ToolMessage(content="Found and booked."),
        AIMessage(content="Done!")
    ],
    reference_tool_calls=[
        ToolCall(name="restaurant_search", args={"cuisine": "Chinese"}),
        ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})
    ]
)

metric = ToolCallF1()
# score = await metric._multi_turn_ascore(sample)
# Expected: 1.0 (perfect precision and recall)

Score Interpretation

Score Meaning
1.0 All reference tool calls were made, and no extra tool calls were made
0.0 No overlap between predicted and reference tool calls
0.0 < score < 1.0 Some tool calls matched; there may be missed calls (lower recall) or extra calls (lower precision)

Internal Dependencies

  • ragas.metrics.base.MultiTurnMetric -- base class providing the multi-turn metric interface
  • ragas.dataset_schema.MultiTurnSample -- input sample schema
  • ragas.messages.AIMessage -- message type from which tool calls are extracted

Implements

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment