Implementation:Explodinggradients Ragas ToolCallAccuracy Metric
ToolCallAccuracy Metric
ToolCallAccuracy is a multi-turn evaluation metric in the Ragas library that measures how accurately an LLM agent's tool calls match a set of reference tool calls. It evaluates both tool selection (name sequence alignment) and argument correctness, producing a score between 0.0 and 1.0.
Source Location
- File:
src/ragas/metrics/_tool_call_accuracy.py(lines 16-181) - Repository: explodinggradients/ragas
Import
from ragas.metrics import ToolCallAccuracy
Class Definition
@dataclass
class ToolCallAccuracy(MultiTurnMetric):
name: str = "tool_call_accuracy"
strict_order: bool = True
_required_columns: t.Dict[MetricType, t.Set[str]] = field(
default_factory=lambda: {
MetricType.MULTI_TURN: {
"user_input",
"reference_tool_calls",
}
}
)
arg_comparison_metric: SingleTurnMetric = field(
default_factory=lambda: ExactMatch()
)
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
strict_order |
bool |
True |
If True, tool calls must match exactly in sequence. If False, tool calls can be in any order (parallel evaluation). |
arg_comparison_metric |
SingleTurnMetric |
ExactMatch() |
The metric used to compare individual tool call arguments. Any single-turn metric that accepts response and reference fields can be used.
|
Required Columns
The metric requires a MultiTurnSample with the following fields:
user_input-- list of conversation messages (from which predicted tool calls inAIMessageobjects are extracted)reference_tool_calls-- list of expectedToolCallobjects
Key Methods
_multi_turn_ascore
async def _multi_turn_ascore(
self, sample: MultiTurnSample, callbacks: Callbacks
) -> float
The primary scoring method. It:
- Extracts predicted tool calls from all
AIMessageobjects insample.user_input - Retrieves reference tool calls from
sample.reference_tool_calls - If
strict_orderis False, sorts both lists using a deterministic key based on tool name and sorted arguments - Checks sequence alignment by comparing tool name lists
- For each aligned pair of tool calls with matching names, computes argument accuracy via
_get_arg_score - Applies coverage penalty if predicted and reference lists differ in length
- Returns
average_arg_score * sequence_alignment_factor
_get_arg_score
async def _get_arg_score(
self, preds: t.Dict[str, t.Any], refs: t.Dict[str, t.Any], callbacks: Callbacks
) -> float
Computes the argument accuracy for a single tool call pair. For each argument key in the reference, it checks whether the same key exists in the prediction and uses arg_comparison_metric to score the match. The result is the average score across all reference argument keys.
is_sequence_aligned
def is_sequence_aligned(
self, pred_sequence: t.List[str], ref_sequence: t.List[str]
) -> bool
Compares tool call name sequences. In strict mode, requires exact equality. In flexible mode, sorts both sequences before comparing.
Usage Example
from ragas.metrics import ToolCallAccuracy
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage, AIMessage, ToolCall, ToolMessage
# Create sample with conversation and reference tool calls
sample = MultiTurnSample(
user_input=[
HumanMessage(content="Book a table at Golden Dragon for 8pm"),
AIMessage(
content="Let me search for that restaurant.",
tool_calls=[ToolCall(name="restaurant_search", args={"query": "Golden Dragon"})]
),
ToolMessage(content="Found: Golden Dragon, 123 Main St"),
AIMessage(
content="I'll book a table now.",
tool_calls=[ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})]
),
ToolMessage(content="Booking confirmed."),
AIMessage(content="Your table is booked at Golden Dragon for 8pm.")
],
reference_tool_calls=[
ToolCall(name="restaurant_search", args={"query": "Golden Dragon"}),
ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})
]
)
# Evaluate with strict ordering (default)
metric = ToolCallAccuracy()
score = metric.single_turn_ascore(sample) # Returns float between 0.0 and 1.0
# Evaluate with flexible ordering
metric_flexible = ToolCallAccuracy(strict_order=False)
score = metric_flexible.single_turn_ascore(sample)
Score Interpretation
| Score | Meaning |
|---|---|
| 1.0 | All tool calls match in name, order, and arguments |
| 0.0 | Tool call sequence does not align, or no predicted tool calls exist |
| 0.0 < score < 1.0 | Sequence aligns but some arguments are incorrect or there is a length mismatch |
Internal Dependencies
ragas.metrics.base.MultiTurnMetric-- base class providing the multi-turn metric interfaceragas.metrics._string.ExactMatch-- default argument comparison metricragas.dataset_schema.MultiTurnSample-- input sample schemaragas.messages.AIMessage,ragas.messages.ToolCall-- message and tool call data types
Implements
See Also
- ToolCallF1 Metric -- F1-based tool call evaluation
- MultiTurnSample Class -- the data schema for multi-turn evaluation samples