Implementation:Explodinggradients Ragas Agent Under Test Pattern
| Knowledge Sources | Type | Domains | Last Updated |
|---|---|---|---|
examples/ragas_examples/agent_evals/agent.py, examples/ragas_examples/agent_evals/evals.py |
Pattern Doc (user-defined interface) | Agent Evaluation, LLM Tool Calling, AI Agent Testing | 2026-02-10 |
Overview
Interface specification for user-defined AI agents that use tool calling capabilities and will be evaluated using the Ragas experiment framework. This pattern describes how a user implements an agent class with a solve() method that orchestrates multi-turn LLM conversations with tool execution, returning the final computed result along with a full execution trace.
Description
The Agent Under Test Pattern defines the concrete interface for AI agents evaluated by Ragas. The reference implementation in the repository is the MathToolsAgent class, which demonstrates:
- Tool definition: Four arithmetic tools (
add,sub,mul,div) defined using the OpenAI function calling schema - Multi-turn conversation: An iterative loop where the LLM plans which tools to call, the agent executes them, and results are fed back into the conversation
- Structured trace capture: Every LLM call, tool execution, and result extraction is logged as a
TraceEventdataclass - Message extraction: The full conversation history (system prompt, user message, assistant responses, tool results) is maintained for downstream analysis
- Result extraction: The final numeric answer is parsed from the LLM's last text response using regex
Usage
To use this pattern, a developer:
- Defines an agent class with tool definitions and implementation methods
- Implements a
solve(problem: str, ...) -> dictmethod that orchestrates the multi-turn conversation - Instantiates the agent with an LLM client and configuration
- Passes the agent to a Ragas
@experiment()-decorated function that callssolve()for each dataset row - Extracts the result and passes it to evaluation metrics
Interface Specification
The expected interface for an agent under evaluation:
from typing import Any, Dict, List, Optional
class AgentUnderTest:
"""Interface that any tool-calling agent must implement for Ragas evaluation."""
def __init__(
self,
client, # LLM API client
model_name: str = "gpt-4o", # Model identifier
system_message: str = "...", # Agent instructions
logdir: str = "logs", # Trace log directory
):
self.client = client
self.model_name = model_name
self.system_message = system_message
self.tools = [...] # OpenAI function calling tool definitions
self.traces = [] # Trace event accumulator
def solve(
self,
problem: str,
max_iterations: int = 10,
run_id: Optional[str] = None,
) -> Dict[str, Any]:
"""
Solve a problem using iterative LLM planning with tool calls.
Args:
problem: The problem description or question.
max_iterations: Maximum number of LLM conversation turns.
run_id: Optional unique identifier for this execution.
Returns:
Dictionary with at minimum:
"result": Any -- The final computed result.
Optional metadata keys:
"log_file": str -- Path to the exported trace log.
"""
...
Example Implementations
MathToolsAgent (Full Reference Implementation)
Source: examples/ragas_examples/agent_evals/agent.py (lines 42-389)
This agent solves mathematical expressions by decomposing them into atomic arithmetic operations using LLM-directed tool calling.
Tool definitions:
class MathToolsAgent:
def __init__(
self,
client,
model_name: str = "gpt-4o",
system_message: str = SYSTEM_MESSAGE,
logdir: str = "logs",
):
self.client = client
self.system_message = system_message
self.model_name = model_name
self.step_counter = 0
self.traces = []
self.logdir = logdir
os.makedirs(self.logdir, exist_ok=True)
# Define available tools using OpenAI function calling schema
self.tools = [
{
"type": "function",
"function": {
"name": "add",
"description": "Add two numbers together",
"parameters": {
"type": "object",
"properties": {
"a": {"type": "number", "description": "First number"},
"b": {"type": "number", "description": "Second number"},
},
"required": ["a", "b"],
},
},
},
{
"type": "function",
"function": {
"name": "sub",
"description": "Subtract second number from first number",
"parameters": {
"type": "object",
"properties": {
"a": {"type": "number", "description": "Number to subtract from"},
"b": {"type": "number", "description": "Number to subtract"},
},
"required": ["a", "b"],
},
},
},
{
"type": "function",
"function": {
"name": "mul",
"description": "Multiply two numbers together",
"parameters": {
"type": "object",
"properties": {
"a": {"type": "number", "description": "First number"},
"b": {"type": "number", "description": "Second number"},
},
"required": ["a", "b"],
},
},
},
{
"type": "function",
"function": {
"name": "div",
"description": "Divide first number by second number",
"parameters": {
"type": "object",
"properties": {
"a": {"type": "number", "description": "Number to divide (numerator)"},
"b": {"type": "number", "description": "Number to divide by (denominator)"},
},
"required": ["a", "b"],
},
},
},
]
Tool implementation methods:
def add(self, a: float, b: float) -> float:
"""Add two numbers"""
return a + b
def sub(self, a: float, b: float) -> float:
"""Subtract b from a"""
return a - b
def mul(self, a: float, b: float) -> float:
"""Multiply two numbers"""
return a * b
def div(self, a: float, b: float) -> float:
"""Divide a by b"""
if b == 0:
raise ValueError("Division by zero")
return a / b
Tool execution dispatch:
def _execute_tool_call(self, tool_call) -> str:
"""Execute a tool call and return the result."""
self.traces.append(
TraceEvent(
event_type="tool_execution",
component="math_tools",
data={
"tool_name": tool_call.function.name,
"args": json.loads(tool_call.function.arguments),
},
)
)
function_name = tool_call.function.name
arguments = json.loads(tool_call.function.arguments)
if function_name == "add":
result = self.add(arguments["a"], arguments["b"])
elif function_name == "sub":
result = self.sub(arguments["a"], arguments["b"])
elif function_name == "mul":
result = self.mul(arguments["a"], arguments["b"])
elif function_name == "div":
result = self.div(arguments["a"], arguments["b"])
else:
raise ValueError(f"Unknown function: {function_name}")
self.traces.append(
TraceEvent(
event_type="tool_result",
component="math_tools",
data={"result": result},
)
)
return str(result)
The solve method (core multi-turn loop):
def solve(
self, problem: str, max_iterations: int = 10, run_id: Optional[str] = None
) -> Dict[str, Any]:
"""
Solve a math problem using iterative planning with LLM and atomic tools.
Returns:
{"result": float, "log_file": str}
"""
if run_id is None:
run_id = (
f"{datetime.now().strftime('%Y%m%d_%H%M%S')}"
f"_{hash(problem) % 10000:04d}"
)
self.traces = []
self.execution_history = []
self.step_counter = 0
messages = [
{"role": "system", "content": self.system_message},
{
"role": "user",
"content": f"Solve this mathematical expression step by step: {problem}",
},
]
iteration = 0
while iteration < max_iterations:
iteration += 1
try:
self.traces.append(
TraceEvent(
event_type="llm_call",
component="openai_api",
data={"model": self.model_name, "messages": messages},
)
)
response = self.client.chat.completions.create(
model=self.model_name,
messages=messages,
tools=self.tools,
tool_choice="auto",
)
message = response.choices[0].message
messages.append(message.model_dump())
self.traces.append(
TraceEvent(
event_type="llm_response",
component="openai_api",
data={
"content": message.content,
"tool_calls": (
[tool.model_dump() for tool in message.tool_calls]
if message.tool_calls
else []
),
},
)
)
if message.tool_calls:
for tool_call in message.tool_calls:
result = self._execute_tool_call(tool_call)
messages.append(
{
"role": "tool",
"tool_call_id": tool_call.id,
"content": result,
}
)
else:
# No more tool calls -- extract final answer
import re
numbers = re.findall(r"-?\d+\.?\d*", message.content)
if numbers:
final_result = float(numbers[-1])
self.traces.append(
TraceEvent(
event_type="result_extraction",
component="math_tools",
data={"final_result": final_result},
)
)
log_filename = self.export_traces_to_log(
run_id, problem, final_result
)
return {"result": final_result, "log_file": log_filename}
else:
break
except Exception as e:
break
return {
"result": 0,
"log_file": self.export_traces_to_log(run_id, problem, 0.0),
}
Evaluation Harness Using MathToolsAgent
Source: examples/ragas_examples/agent_evals/evals.py
from ragas import Dataset, experiment
from ragas.metrics.numeric import numeric_metric
from ragas.metrics.result import MetricResult
from .agent import get_default_agent
math_agent = get_default_agent()
@numeric_metric(name="correctness", allowed_values=(0.0, 1.0))
def correctness_metric(prediction: float, actual: float):
"""Calculate correctness of the prediction."""
if isinstance(prediction, str) and "ERROR" in prediction:
return 0.0
result = 1.0 if abs(prediction - actual) < 1e-5 else 0.0
return MetricResult(
value=result, reason=f"Prediction: {prediction}, Actual: {actual}"
)
@experiment()
async def run_experiment(row):
question = row["question"]
expected_answer = row["answer"]
# Get the agent's prediction via the solve() interface
prediction = math_agent.solve(question)
# Calculate the correctness metric
correctness = correctness_metric.score(
prediction=prediction.get("result"), actual=expected_answer
)
return {
"question": question,
"expected_answer": expected_answer,
"prediction": prediction.get("result"),
"log_file": prediction.get("log_file"),
"correctness": correctness.value,
}
Dataset construction:
def load_dataset():
dataset = Dataset(name="test_dataset", backend="local/csv", root_dir=".")
math_problems = [
{"question": "15 - 3 / 4", "answer": 14.25},
{"question": "(2 + 3) * (6 - 2)", "answer": 20.0},
{"question": "100 / 5 + 3 * 2", "answer": 26.0},
{"question": "((2 * 3) + (4 * 5)) * ((6 - 2) / (8 / 4))", "answer": 52.0},
{"question": "2 + 3 * 4 - 5 / 6 + 7", "answer": 20.166666666666664},
{"question": "(10 / 2) + (20 / 4) + (30 / 6) + (40 / 8)", "answer": 20.0},
{"question": "1/3 + 1/3 + 1/3", "answer": 1.0},
]
for row in math_problems:
dataset.append(row)
dataset.save()
return dataset
Key Observations
- Iterative conversation pattern: The
solve()method implements a while loop bounded bymax_iterationsthat alternates between LLM inference and tool execution. This is the standard pattern for any tool-calling agent. - Trace as first-class output: Every interaction (LLM calls, tool executions, results) is recorded as a
TraceEventdataclass and exported to JSON log files. This enables post-hoc analysis beyond simple pass/fail scoring. - Tool dispatch via name matching: The
_execute_tool_callmethod maps function names from the LLM's tool call requests to local Python methods. This dispatch pattern is extensible to any number of tools. - Graceful termination: The agent handles three exit conditions: (1) the LLM produces a final text answer, (2) the maximum iteration count is reached, (3) an exception occurs. All three paths return a valid result dictionary.
- System message as agent prompt: The
SYSTEM_MESSAGEconstant defines the agent's behavior, tool awareness, and reasoning instructions. Changing this string changes the agent's strategy without modifying the interface.