Implementation:Explodinggradients Ragas Agent Under Test Pattern

Knowledge Sources	Type	Domains	Last Updated
`examples/ragas_examples/agent_evals/agent.py`, `examples/ragas_examples/agent_evals/evals.py`	Pattern Doc (user-defined interface)	Agent Evaluation, LLM Tool Calling, AI Agent Testing	2026-02-10

Overview

Interface specification for user-defined AI agents that use tool calling capabilities and will be evaluated using the Ragas experiment framework. This pattern describes how a user implements an agent class with a solve() method that orchestrates multi-turn LLM conversations with tool execution, returning the final computed result along with a full execution trace.

Description

The Agent Under Test Pattern defines the concrete interface for AI agents evaluated by Ragas. The reference implementation in the repository is the MathToolsAgent class, which demonstrates:

Tool definition: Four arithmetic tools (add, sub, mul, div) defined using the OpenAI function calling schema
Multi-turn conversation: An iterative loop where the LLM plans which tools to call, the agent executes them, and results are fed back into the conversation
Structured trace capture: Every LLM call, tool execution, and result extraction is logged as a TraceEvent dataclass
Message extraction: The full conversation history (system prompt, user message, assistant responses, tool results) is maintained for downstream analysis
Result extraction: The final numeric answer is parsed from the LLM's last text response using regex

Usage

To use this pattern, a developer:

Defines an agent class with tool definitions and implementation methods
Implements a solve(problem: str, ...) -> dict method that orchestrates the multi-turn conversation
Instantiates the agent with an LLM client and configuration
Passes the agent to a Ragas @experiment()-decorated function that calls solve() for each dataset row
Extracts the result and passes it to evaluation metrics

Interface Specification

The expected interface for an agent under evaluation:

from typing import Any, Dict, List, Optional

class AgentUnderTest:
    """Interface that any tool-calling agent must implement for Ragas evaluation."""

    def __init__(
        self,
        client,                          # LLM API client
        model_name: str = "gpt-4o",      # Model identifier
        system_message: str = "...",     # Agent instructions
        logdir: str = "logs",            # Trace log directory
    ):
        self.client = client
        self.model_name = model_name
        self.system_message = system_message
        self.tools = [...]               # OpenAI function calling tool definitions
        self.traces = []                 # Trace event accumulator

    def solve(
        self,
        problem: str,
        max_iterations: int = 10,
        run_id: Optional[str] = None,
    ) -> Dict[str, Any]:
        """
        Solve a problem using iterative LLM planning with tool calls.

        Args:
            problem: The problem description or question.
            max_iterations: Maximum number of LLM conversation turns.
            run_id: Optional unique identifier for this execution.

        Returns:
            Dictionary with at minimum:
                "result": Any            -- The final computed result.
            Optional metadata keys:
                "log_file": str          -- Path to the exported trace log.
        """
        ...

Example Implementations

MathToolsAgent (Full Reference Implementation)

Source: examples/ragas_examples/agent_evals/agent.py (lines 42-389)

This agent solves mathematical expressions by decomposing them into atomic arithmetic operations using LLM-directed tool calling.

Tool definitions:

class MathToolsAgent:
    def __init__(
        self,
        client,
        model_name: str = "gpt-4o",
        system_message: str = SYSTEM_MESSAGE,
        logdir: str = "logs",
    ):
        self.client = client
        self.system_message = system_message
        self.model_name = model_name
        self.step_counter = 0
        self.traces = []
        self.logdir = logdir
        os.makedirs(self.logdir, exist_ok=True)

        # Define available tools using OpenAI function calling schema
        self.tools = [
            {
                "type": "function",
                "function": {
                    "name": "add",
                    "description": "Add two numbers together",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "a": {"type": "number", "description": "First number"},
                            "b": {"type": "number", "description": "Second number"},
                        },
                        "required": ["a", "b"],
                    },
                },
            },
            {
                "type": "function",
                "function": {
                    "name": "sub",
                    "description": "Subtract second number from first number",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "a": {"type": "number", "description": "Number to subtract from"},
                            "b": {"type": "number", "description": "Number to subtract"},
                        },
                        "required": ["a", "b"],
                    },
                },
            },
            {
                "type": "function",
                "function": {
                    "name": "mul",
                    "description": "Multiply two numbers together",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "a": {"type": "number", "description": "First number"},
                            "b": {"type": "number", "description": "Second number"},
                        },
                        "required": ["a", "b"],
                    },
                },
            },
            {
                "type": "function",
                "function": {
                    "name": "div",
                    "description": "Divide first number by second number",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "a": {"type": "number", "description": "Number to divide (numerator)"},
                            "b": {"type": "number", "description": "Number to divide by (denominator)"},
                        },
                        "required": ["a", "b"],
                    },
                },
            },
        ]

Tool implementation methods:

    def add(self, a: float, b: float) -> float:
        """Add two numbers"""
        return a + b

    def sub(self, a: float, b: float) -> float:
        """Subtract b from a"""
        return a - b

    def mul(self, a: float, b: float) -> float:
        """Multiply two numbers"""
        return a * b

    def div(self, a: float, b: float) -> float:
        """Divide a by b"""
        if b == 0:
            raise ValueError("Division by zero")
        return a / b

Tool execution dispatch:

    def _execute_tool_call(self, tool_call) -> str:
        """Execute a tool call and return the result."""
        self.traces.append(
            TraceEvent(
                event_type="tool_execution",
                component="math_tools",
                data={
                    "tool_name": tool_call.function.name,
                    "args": json.loads(tool_call.function.arguments),
                },
            )
        )

        function_name = tool_call.function.name
        arguments = json.loads(tool_call.function.arguments)

        if function_name == "add":
            result = self.add(arguments["a"], arguments["b"])
        elif function_name == "sub":
            result = self.sub(arguments["a"], arguments["b"])
        elif function_name == "mul":
            result = self.mul(arguments["a"], arguments["b"])
        elif function_name == "div":
            result = self.div(arguments["a"], arguments["b"])
        else:
            raise ValueError(f"Unknown function: {function_name}")

        self.traces.append(
            TraceEvent(
                event_type="tool_result",
                component="math_tools",
                data={"result": result},
            )
        )
        return str(result)

The solve method (core multi-turn loop):

    def solve(
        self, problem: str, max_iterations: int = 10, run_id: Optional[str] = None
    ) -> Dict[str, Any]:
        """
        Solve a math problem using iterative planning with LLM and atomic tools.

        Returns:
            {"result": float, "log_file": str}
        """
        if run_id is None:
            run_id = (
                f"{datetime.now().strftime('%Y%m%d_%H%M%S')}"
                f"_{hash(problem) % 10000:04d}"
            )

        self.traces = []
        self.execution_history = []
        self.step_counter = 0

        messages = [
            {"role": "system", "content": self.system_message},
            {
                "role": "user",
                "content": f"Solve this mathematical expression step by step: {problem}",
            },
        ]

        iteration = 0
        while iteration < max_iterations:
            iteration += 1
            try:
                self.traces.append(
                    TraceEvent(
                        event_type="llm_call",
                        component="openai_api",
                        data={"model": self.model_name, "messages": messages},
                    )
                )

                response = self.client.chat.completions.create(
                    model=self.model_name,
                    messages=messages,
                    tools=self.tools,
                    tool_choice="auto",
                )

                message = response.choices[0].message
                messages.append(message.model_dump())

                self.traces.append(
                    TraceEvent(
                        event_type="llm_response",
                        component="openai_api",
                        data={
                            "content": message.content,
                            "tool_calls": (
                                [tool.model_dump() for tool in message.tool_calls]
                                if message.tool_calls
                                else []
                            ),
                        },
                    )
                )

                if message.tool_calls:
                    for tool_call in message.tool_calls:
                        result = self._execute_tool_call(tool_call)
                        messages.append(
                            {
                                "role": "tool",
                                "tool_call_id": tool_call.id,
                                "content": result,
                            }
                        )
                else:
                    # No more tool calls -- extract final answer
                    import re
                    numbers = re.findall(r"-?\d+\.?\d*", message.content)
                    if numbers:
                        final_result = float(numbers[-1])
                        self.traces.append(
                            TraceEvent(
                                event_type="result_extraction",
                                component="math_tools",
                                data={"final_result": final_result},
                            )
                        )
                        log_filename = self.export_traces_to_log(
                            run_id, problem, final_result
                        )
                        return {"result": final_result, "log_file": log_filename}
                    else:
                        break
            except Exception as e:
                break

        return {
            "result": 0,
            "log_file": self.export_traces_to_log(run_id, problem, 0.0),
        }

Evaluation Harness Using MathToolsAgent

Source: examples/ragas_examples/agent_evals/evals.py

from ragas import Dataset, experiment
from ragas.metrics.numeric import numeric_metric
from ragas.metrics.result import MetricResult
from .agent import get_default_agent

math_agent = get_default_agent()


@numeric_metric(name="correctness", allowed_values=(0.0, 1.0))
def correctness_metric(prediction: float, actual: float):
    """Calculate correctness of the prediction."""
    if isinstance(prediction, str) and "ERROR" in prediction:
        return 0.0
    result = 1.0 if abs(prediction - actual) < 1e-5 else 0.0
    return MetricResult(
        value=result, reason=f"Prediction: {prediction}, Actual: {actual}"
    )


@experiment()
async def run_experiment(row):
    question = row["question"]
    expected_answer = row["answer"]

    # Get the agent's prediction via the solve() interface
    prediction = math_agent.solve(question)

    # Calculate the correctness metric
    correctness = correctness_metric.score(
        prediction=prediction.get("result"), actual=expected_answer
    )

    return {
        "question": question,
        "expected_answer": expected_answer,
        "prediction": prediction.get("result"),
        "log_file": prediction.get("log_file"),
        "correctness": correctness.value,
    }

Dataset construction:

def load_dataset():
    dataset = Dataset(name="test_dataset", backend="local/csv", root_dir=".")

    math_problems = [
        {"question": "15 - 3 / 4", "answer": 14.25},
        {"question": "(2 + 3) * (6 - 2)", "answer": 20.0},
        {"question": "100 / 5 + 3 * 2", "answer": 26.0},
        {"question": "((2 * 3) + (4 * 5)) * ((6 - 2) / (8 / 4))", "answer": 52.0},
        {"question": "2 + 3 * 4 - 5 / 6 + 7", "answer": 20.166666666666664},
        {"question": "(10 / 2) + (20 / 4) + (30 / 6) + (40 / 8)", "answer": 20.0},
        {"question": "1/3 + 1/3 + 1/3", "answer": 1.0},
    ]

    for row in math_problems:
        dataset.append(row)

    dataset.save()
    return dataset

Key Observations

Iterative conversation pattern: The solve() method implements a while loop bounded by max_iterations that alternates between LLM inference and tool execution. This is the standard pattern for any tool-calling agent.
Trace as first-class output: Every interaction (LLM calls, tool executions, results) is recorded as a TraceEvent dataclass and exported to JSON log files. This enables post-hoc analysis beyond simple pass/fail scoring.
Tool dispatch via name matching: The _execute_tool_call method maps function names from the LLM's tool call requests to local Python methods. This dispatch pattern is extensible to any number of tools.
Graceful termination: The agent handles three exit conditions: (1) the LLM produces a final text answer, (2) the maximum iteration count is reached, (3) an exception occurs. All three paths return a valid result dictionary.
System message as agent prompt: The SYSTEM_MESSAGE constant defines the agent's behavior, tool awareness, and reasoning instructions. Changing this string changes the agent's strategy without modifying the interface.

Related Pages

Principle:Explodinggradients_Ragas_Agent_Definition_Interface

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment