Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Explodinggradients Ragas ToolCallAccuracy Metric

From Leeroopedia


ToolCallAccuracy Metric

ToolCallAccuracy is a multi-turn evaluation metric in the Ragas library that measures how accurately an LLM agent's tool calls match a set of reference tool calls. It evaluates both tool selection (name sequence alignment) and argument correctness, producing a score between 0.0 and 1.0.

Source Location

Import

from ragas.metrics import ToolCallAccuracy

Class Definition

@dataclass
class ToolCallAccuracy(MultiTurnMetric):
    name: str = "tool_call_accuracy"
    strict_order: bool = True
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {
            MetricType.MULTI_TURN: {
                "user_input",
                "reference_tool_calls",
            }
        }
    )
    arg_comparison_metric: SingleTurnMetric = field(
        default_factory=lambda: ExactMatch()
    )

Constructor Parameters

Parameter Type Default Description
strict_order bool True If True, tool calls must match exactly in sequence. If False, tool calls can be in any order (parallel evaluation).
arg_comparison_metric SingleTurnMetric ExactMatch() The metric used to compare individual tool call arguments. Any single-turn metric that accepts response and reference fields can be used.

Required Columns

The metric requires a MultiTurnSample with the following fields:

  • user_input -- list of conversation messages (from which predicted tool calls in AIMessage objects are extracted)
  • reference_tool_calls -- list of expected ToolCall objects

Key Methods

_multi_turn_ascore

async def _multi_turn_ascore(
    self, sample: MultiTurnSample, callbacks: Callbacks
) -> float

The primary scoring method. It:

  1. Extracts predicted tool calls from all AIMessage objects in sample.user_input
  2. Retrieves reference tool calls from sample.reference_tool_calls
  3. If strict_order is False, sorts both lists using a deterministic key based on tool name and sorted arguments
  4. Checks sequence alignment by comparing tool name lists
  5. For each aligned pair of tool calls with matching names, computes argument accuracy via _get_arg_score
  6. Applies coverage penalty if predicted and reference lists differ in length
  7. Returns average_arg_score * sequence_alignment_factor

_get_arg_score

async def _get_arg_score(
    self, preds: t.Dict[str, t.Any], refs: t.Dict[str, t.Any], callbacks: Callbacks
) -> float

Computes the argument accuracy for a single tool call pair. For each argument key in the reference, it checks whether the same key exists in the prediction and uses arg_comparison_metric to score the match. The result is the average score across all reference argument keys.

is_sequence_aligned

def is_sequence_aligned(
    self, pred_sequence: t.List[str], ref_sequence: t.List[str]
) -> bool

Compares tool call name sequences. In strict mode, requires exact equality. In flexible mode, sorts both sequences before comparing.

Usage Example

from ragas.metrics import ToolCallAccuracy
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage, AIMessage, ToolCall, ToolMessage

# Create sample with conversation and reference tool calls
sample = MultiTurnSample(
    user_input=[
        HumanMessage(content="Book a table at Golden Dragon for 8pm"),
        AIMessage(
            content="Let me search for that restaurant.",
            tool_calls=[ToolCall(name="restaurant_search", args={"query": "Golden Dragon"})]
        ),
        ToolMessage(content="Found: Golden Dragon, 123 Main St"),
        AIMessage(
            content="I'll book a table now.",
            tool_calls=[ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})]
        ),
        ToolMessage(content="Booking confirmed."),
        AIMessage(content="Your table is booked at Golden Dragon for 8pm.")
    ],
    reference_tool_calls=[
        ToolCall(name="restaurant_search", args={"query": "Golden Dragon"}),
        ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})
    ]
)

# Evaluate with strict ordering (default)
metric = ToolCallAccuracy()
score = metric.single_turn_ascore(sample)  # Returns float between 0.0 and 1.0

# Evaluate with flexible ordering
metric_flexible = ToolCallAccuracy(strict_order=False)
score = metric_flexible.single_turn_ascore(sample)

Score Interpretation

Score Meaning
1.0 All tool calls match in name, order, and arguments
0.0 Tool call sequence does not align, or no predicted tool calls exist
0.0 < score < 1.0 Sequence aligns but some arguments are incorrect or there is a length mismatch

Internal Dependencies

  • ragas.metrics.base.MultiTurnMetric -- base class providing the multi-turn metric interface
  • ragas.metrics._string.ExactMatch -- default argument comparison metric
  • ragas.dataset_schema.MultiTurnSample -- input sample schema
  • ragas.messages.AIMessage, ragas.messages.ToolCall -- message and tool call data types

Implements

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment