Implementation:Explodinggradients Ragas ToolCallAccuracy Metric

ToolCallAccuracy Metric

ToolCallAccuracy is a multi-turn evaluation metric in the Ragas library that measures how accurately an LLM agent's tool calls match a set of reference tool calls. It evaluates both tool selection (name sequence alignment) and argument correctness, producing a score between 0.0 and 1.0.

Source Location

File: src/ragas/metrics/_tool_call_accuracy.py (lines 16-181)
Repository: explodinggradients/ragas

Import

from ragas.metrics import ToolCallAccuracy

Class Definition

@dataclass
class ToolCallAccuracy(MultiTurnMetric):
    name: str = "tool_call_accuracy"
    strict_order: bool = True
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {
            MetricType.MULTI_TURN: {
                "user_input",
                "reference_tool_calls",
            }
        }
    )
    arg_comparison_metric: SingleTurnMetric = field(
        default_factory=lambda: ExactMatch()
    )

Constructor Parameters

Parameter	Type	Default	Description
`strict_order`	`bool`	`True`	If True, tool calls must match exactly in sequence. If False, tool calls can be in any order (parallel evaluation).
`arg_comparison_metric`	`SingleTurnMetric`	`ExactMatch()`	The metric used to compare individual tool call arguments. Any single-turn metric that accepts `response` and `reference` fields can be used.

Required Columns

The metric requires a MultiTurnSample with the following fields:

user_input -- list of conversation messages (from which predicted tool calls in AIMessage objects are extracted)
reference_tool_calls -- list of expected ToolCall objects

Key Methods

_multi_turn_ascore

async def _multi_turn_ascore(
    self, sample: MultiTurnSample, callbacks: Callbacks
) -> float

The primary scoring method. It:

Extracts predicted tool calls from all AIMessage objects in sample.user_input
Retrieves reference tool calls from sample.reference_tool_calls
If strict_order is False, sorts both lists using a deterministic key based on tool name and sorted arguments
Checks sequence alignment by comparing tool name lists
For each aligned pair of tool calls with matching names, computes argument accuracy via _get_arg_score
Applies coverage penalty if predicted and reference lists differ in length
Returns average_arg_score * sequence_alignment_factor

_get_arg_score

async def _get_arg_score(
    self, preds: t.Dict[str, t.Any], refs: t.Dict[str, t.Any], callbacks: Callbacks
) -> float

Computes the argument accuracy for a single tool call pair. For each argument key in the reference, it checks whether the same key exists in the prediction and uses arg_comparison_metric to score the match. The result is the average score across all reference argument keys.

is_sequence_aligned

def is_sequence_aligned(
    self, pred_sequence: t.List[str], ref_sequence: t.List[str]
) -> bool

Compares tool call name sequences. In strict mode, requires exact equality. In flexible mode, sorts both sequences before comparing.

Usage Example

from ragas.metrics import ToolCallAccuracy
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage, AIMessage, ToolCall, ToolMessage

# Create sample with conversation and reference tool calls
sample = MultiTurnSample(
    user_input=[
        HumanMessage(content="Book a table at Golden Dragon for 8pm"),
        AIMessage(
            content="Let me search for that restaurant.",
            tool_calls=[ToolCall(name="restaurant_search", args={"query": "Golden Dragon"})]
        ),
        ToolMessage(content="Found: Golden Dragon, 123 Main St"),
        AIMessage(
            content="I'll book a table now.",
            tool_calls=[ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})]
        ),
        ToolMessage(content="Booking confirmed."),
        AIMessage(content="Your table is booked at Golden Dragon for 8pm.")
    ],
    reference_tool_calls=[
        ToolCall(name="restaurant_search", args={"query": "Golden Dragon"}),
        ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})
    ]
)

# Evaluate with strict ordering (default)
metric = ToolCallAccuracy()
score = metric.single_turn_ascore(sample)  # Returns float between 0.0 and 1.0

# Evaluate with flexible ordering
metric_flexible = ToolCallAccuracy(strict_order=False)
score = metric_flexible.single_turn_ascore(sample)

Score Interpretation

Score	Meaning
1.0	All tool calls match in name, order, and arguments
0.0	Tool call sequence does not align, or no predicted tool calls exist
0.0 < score < 1.0	Sequence aligns but some arguments are incorrect or there is a length mismatch

Internal Dependencies

ragas.metrics.base.MultiTurnMetric -- base class providing the multi-turn metric interface
ragas.metrics._string.ExactMatch -- default argument comparison metric
ragas.dataset_schema.MultiTurnSample -- input sample schema
ragas.messages.AIMessage, ragas.messages.ToolCall -- message and tool call data types

Implements

Principle:Explodinggradients_Ragas_Tool_Call_Accuracy_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment