Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Explodinggradients Ragas Agent Definition Interface

From Leeroopedia
Revision as of 17:37, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Explodinggradients_Ragas_Agent_Definition_Interface.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources Domains Last Updated
examples/ragas_examples/agent_evals/agent.py, examples/ragas_examples/agent_evals/evals.py Agent Evaluation, LLM Tool Calling, AI Agent Testing 2026-02-10

Overview

Description

The Agent Definition Interface principle establishes a standard contract for AI agents under evaluation that use tool calling capabilities. Unlike simple prompt-in/text-out LLM evaluation, agent evaluation must capture the full interaction trace -- including the sequence of tool calls, intermediate results, multi-turn conversation messages, and the final response. This principle mandates that agents expose their behavior through a structured method (e.g., solve()) that returns both the computed result and the complete execution trace, enabling metrics to assess not only the correctness of the final answer but also the quality of the agent's reasoning and tool-use strategy.

Usage

When building an AI agent intended for evaluation with Ragas, implementers define a class that:

  • Declares a set of callable tools with structured parameter schemas (following the OpenAI function calling format)
  • Implements a primary method such as solve(problem: str) -> dict that:
    • Initiates a multi-turn conversation with the LLM
    • Iteratively executes tool calls as directed by the LLM
    • Captures the full message history and tool execution trace
    • Returns a dictionary containing the final result and trace metadata
  • Supports configurable parameters (model name, system prompt, maximum iterations) without changing the interface contract

The evaluation harness calls this method for each row in the evaluation dataset, extracts the result, and passes it to numerical or discrete metrics for scoring.

Theoretical Basis

Why Agent Evaluation Differs from LLM Evaluation

Standard LLM evaluation treats the model as a stateless function: given an input prompt, produce an output text. Agent evaluation is fundamentally more complex because:

  • Multi-turn interaction: Agents engage in iterative dialogue with the LLM, where each turn may involve zero, one, or many tool calls. The evaluation must account for the entire conversation, not just the final message.
  • Tool-mediated computation: Agents perform actions through tools (functions, APIs, database queries). Evaluating the agent requires understanding which tools were called, with what arguments, and whether the results were used correctly.
  • Branching and recovery: Agents may encounter errors (e.g., division by zero, invalid tool arguments) and must recover gracefully. The evaluation interface must allow these failure modes to be captured.
  • Iteration bounds: Agents need a maximum iteration limit to prevent infinite loops. The interface must surface whether the agent completed successfully or hit its iteration ceiling.

Structured Trace Capture

The principle requires agents to produce structured traces (not just a final answer) because downstream metrics may need to assess:

  • Tool selection accuracy: Did the agent choose the right tools for the task?
  • Argument correctness: Were the tool arguments computed correctly?
  • Step efficiency: Did the agent solve the problem in a reasonable number of steps?
  • Reasoning quality: Did the LLM's intermediate reasoning (visible in message content) reflect sound logic?

By standardizing the trace format through dataclasses like TraceEvent and ToolResult, the principle enables metrics to operate on structured data rather than parsing free-text logs.

Decoupling Agent Logic from Evaluation

Just as the RAG System Interface principle decouples retrieval implementation from evaluation, the Agent Definition Interface decouples the agent's internal decision-making from the evaluation framework. The evaluation harness only depends on:

  • Calling solve(problem) (or equivalent)
  • Reading result["result"] for the final answer
  • Optionally reading trace metadata for deeper analysis

This means the same evaluation pipeline can test agents with different tool sets, different LLM backends, or different prompting strategies -- as long as they expose the same interface.

Practical Guide

Defining Tools

Tools are defined using the OpenAI function calling schema format. Each tool specifies its name, description, parameter types, and required parameters:

tools = [
    {
        "type": "function",
        "function": {
            "name": "tool_name",
            "description": "What this tool does",
            "parameters": {
                "type": "object",
                "properties": {
                    "param1": {"type": "number", "description": "First parameter"},
                    "param2": {"type": "string", "description": "Second parameter"},
                },
                "required": ["param1", "param2"],
            },
        },
    },
]

Implementing the Solve Method

The agent's primary method follows a loop pattern:

  1. Send the problem to the LLM with the system prompt and tool definitions
  2. If the LLM responds with tool calls, execute them and feed results back
  3. Repeat until the LLM produces a final text response (no tool calls) or the iteration limit is reached
  4. Return a dictionary with the result and metadata

Wiring into Evaluation

from ragas import experiment
from ragas.metrics.numeric import numeric_metric

@numeric_metric(name="correctness", allowed_values=(0.0, 1.0))
def correctness_metric(prediction: float, actual: float):
    result = 1.0 if abs(prediction - actual) < 1e-5 else 0.0
    return MetricResult(value=result, reason=f"Predicted: {prediction}, Actual: {actual}")

@experiment()
async def run_experiment(row):
    prediction = agent.solve(row["question"])
    correctness = correctness_metric.score(
        prediction=prediction.get("result"),
        actual=row["answer"],
    )
    return {
        "question": row["question"],
        "expected_answer": row["answer"],
        "prediction": prediction.get("result"),
        "correctness": correctness.value,
    }

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment