Principle:Explodinggradients Ragas Agent Definition Interface
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
examples/ragas_examples/agent_evals/agent.py, examples/ragas_examples/agent_evals/evals.py |
Agent Evaluation, LLM Tool Calling, AI Agent Testing | 2026-02-10 |
Overview
Description
The Agent Definition Interface principle establishes a standard contract for AI agents under evaluation that use tool calling capabilities. Unlike simple prompt-in/text-out LLM evaluation, agent evaluation must capture the full interaction trace -- including the sequence of tool calls, intermediate results, multi-turn conversation messages, and the final response. This principle mandates that agents expose their behavior through a structured method (e.g., solve()) that returns both the computed result and the complete execution trace, enabling metrics to assess not only the correctness of the final answer but also the quality of the agent's reasoning and tool-use strategy.
Usage
When building an AI agent intended for evaluation with Ragas, implementers define a class that:
- Declares a set of callable tools with structured parameter schemas (following the OpenAI function calling format)
- Implements a primary method such as
solve(problem: str) -> dictthat:- Initiates a multi-turn conversation with the LLM
- Iteratively executes tool calls as directed by the LLM
- Captures the full message history and tool execution trace
- Returns a dictionary containing the final result and trace metadata
- Supports configurable parameters (model name, system prompt, maximum iterations) without changing the interface contract
The evaluation harness calls this method for each row in the evaluation dataset, extracts the result, and passes it to numerical or discrete metrics for scoring.
Theoretical Basis
Why Agent Evaluation Differs from LLM Evaluation
Standard LLM evaluation treats the model as a stateless function: given an input prompt, produce an output text. Agent evaluation is fundamentally more complex because:
- Multi-turn interaction: Agents engage in iterative dialogue with the LLM, where each turn may involve zero, one, or many tool calls. The evaluation must account for the entire conversation, not just the final message.
- Tool-mediated computation: Agents perform actions through tools (functions, APIs, database queries). Evaluating the agent requires understanding which tools were called, with what arguments, and whether the results were used correctly.
- Branching and recovery: Agents may encounter errors (e.g., division by zero, invalid tool arguments) and must recover gracefully. The evaluation interface must allow these failure modes to be captured.
- Iteration bounds: Agents need a maximum iteration limit to prevent infinite loops. The interface must surface whether the agent completed successfully or hit its iteration ceiling.
Structured Trace Capture
The principle requires agents to produce structured traces (not just a final answer) because downstream metrics may need to assess:
- Tool selection accuracy: Did the agent choose the right tools for the task?
- Argument correctness: Were the tool arguments computed correctly?
- Step efficiency: Did the agent solve the problem in a reasonable number of steps?
- Reasoning quality: Did the LLM's intermediate reasoning (visible in message content) reflect sound logic?
By standardizing the trace format through dataclasses like TraceEvent and ToolResult, the principle enables metrics to operate on structured data rather than parsing free-text logs.
Decoupling Agent Logic from Evaluation
Just as the RAG System Interface principle decouples retrieval implementation from evaluation, the Agent Definition Interface decouples the agent's internal decision-making from the evaluation framework. The evaluation harness only depends on:
- Calling
solve(problem)(or equivalent) - Reading
result["result"]for the final answer - Optionally reading trace metadata for deeper analysis
This means the same evaluation pipeline can test agents with different tool sets, different LLM backends, or different prompting strategies -- as long as they expose the same interface.
Practical Guide
Defining Tools
Tools are defined using the OpenAI function calling schema format. Each tool specifies its name, description, parameter types, and required parameters:
tools = [
{
"type": "function",
"function": {
"name": "tool_name",
"description": "What this tool does",
"parameters": {
"type": "object",
"properties": {
"param1": {"type": "number", "description": "First parameter"},
"param2": {"type": "string", "description": "Second parameter"},
},
"required": ["param1", "param2"],
},
},
},
]
Implementing the Solve Method
The agent's primary method follows a loop pattern:
- Send the problem to the LLM with the system prompt and tool definitions
- If the LLM responds with tool calls, execute them and feed results back
- Repeat until the LLM produces a final text response (no tool calls) or the iteration limit is reached
- Return a dictionary with the result and metadata
Wiring into Evaluation
from ragas import experiment
from ragas.metrics.numeric import numeric_metric
@numeric_metric(name="correctness", allowed_values=(0.0, 1.0))
def correctness_metric(prediction: float, actual: float):
result = 1.0 if abs(prediction - actual) < 1e-5 else 0.0
return MetricResult(value=result, reason=f"Predicted: {prediction}, Actual: {actual}")
@experiment()
async def run_experiment(row):
prediction = agent.solve(row["question"])
correctness = correctness_metric.score(
prediction=prediction.get("result"),
actual=row["answer"],
)
return {
"question": row["question"],
"expected_answer": row["answer"],
"prediction": prediction.get("result"),
"correctness": correctness.value,
}