Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Explodinggradients Ragas AgentGoalAccuracy Metric

From Leeroopedia


AgentGoalAccuracy Metric

AgentGoalAccuracyWithReference and AgentGoalAccuracyWithoutReference are multi-turn evaluation metrics in the Ragas library that use LLM-based judgment to determine whether an AI agent achieved its intended goal during a conversation. They return a binary score (0.0 or 1.0).

Source Location

  • File: src/ragas/metrics/_goal_accuracy.py
    • AgentGoalAccuracyWithReference: lines 103-144
    • AgentGoalAccuracyWithoutReference: lines 147-185
  • Repository: explodinggradients/ragas

Import

from ragas.metrics import AgentGoalAccuracyWithReference, AgentGoalAccuracyWithoutReference

Class Definitions

AgentGoalAccuracyWithReference

@dataclass
class AgentGoalAccuracyWithReference(MetricWithLLM, MultiTurnMetric):
    name: str = "agent_goal_accuracy"
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {
            MetricType.MULTI_TURN: {
                "user_input",
                "reference",
            }
        }
    )
    output_type: t.Optional[MetricOutputType] = MetricOutputType.BINARY
    workflow_prompt: PydanticPrompt = field(
        default_factory=lambda: InferGoalOutcomePrompt()
    )
    compare_outcome_prompt: PydanticPrompt = field(
        default_factory=lambda: CompareOutcomePrompt()
    )
    max_retries: int = 1

AgentGoalAccuracyWithoutReference

@dataclass
class AgentGoalAccuracyWithoutReference(MetricWithLLM, MultiTurnMetric):
    name: str = "agent_goal_accuracy"
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {
            MetricType.MULTI_TURN: {
                "user_input",
            }
        }
    )
    workflow_prompt: PydanticPrompt = field(
        default_factory=lambda: InferGoalOutcomePrompt()
    )
    compare_outcome_prompt: PydanticPrompt = field(
        default_factory=lambda: CompareOutcomePrompt()
    )
    max_retries: int = 1

Required Columns

Variant Required Columns Description
AgentGoalAccuracyWithReference user_input, reference Requires both the conversation messages and a reference expected outcome string
AgentGoalAccuracyWithoutReference user_input Only requires the conversation messages; goal is inferred from the conversation itself

Key Methods

_multi_turn_ascore (WithReference)

async def _multi_turn_ascore(
    self,
    sample: MultiTurnSample,
    callbacks: Callbacks,
) -> float

Evaluation flow for the reference-based variant:

  1. Asserts that self.llm and sample.reference are set
  2. Serializes the conversation using sample.pretty_repr() and passes it to the workflow_prompt (InferGoalOutcomePrompt)
  3. The LLM returns a WorkflowOutput with user_goal and end_state
  4. Constructs a CompareOutcomeInput with desired_outcome=sample.reference and arrived_outcome=response.end_state
  5. Passes this to the compare_outcome_prompt (CompareOutcomePrompt)
  6. Returns the verdict as a float (0.0 or 1.0)

_multi_turn_ascore (WithoutReference)

async def _multi_turn_ascore(
    self,
    sample: MultiTurnSample,
    callbacks: Callbacks,
) -> float

Evaluation flow for the reference-free variant:

  1. Asserts that self.llm is set (no reference required)
  2. Serializes the conversation and passes it to the workflow_prompt
  3. The LLM returns a WorkflowOutput with user_goal and end_state
  4. Constructs a CompareOutcomeInput with desired_outcome=response.user_goal and arrived_outcome=response.end_state
  5. Passes this to the compare_outcome_prompt
  6. Returns the verdict as a float (0.0 or 1.0)

The key difference is that the reference-free variant uses the LLM-inferred user_goal as the desired outcome, while the reference-based variant uses the provided sample.reference.

Supporting Pydantic Models

The module defines several Pydantic models used as prompt input/output schemas:

Model Purpose Fields
WorkflowInput Input to InferGoalOutcomePrompt workflow: str -- the serialized conversation
WorkflowOutput Output from InferGoalOutcomePrompt user_goal: str, end_state: str
CompareOutcomeInput Input to CompareOutcomePrompt desired_outcome: str, arrived_outcome: str
CompareOutcomeOutput Output from CompareOutcomePrompt reason: str, verdict: Literal["0", "1"]

Prompt Classes

InferGoalOutcomePrompt

Instruction: "Given an agentic workflow comprised of Human, AI and Tools, identify the user_goal (the task or objective the user wants to achieve) and the end_state (the final outcome or result of the workflow)."

Includes a few-shot example of a restaurant booking conversation.

CompareOutcomePrompt

Instruction: "Given user goal, desired outcome and achieved outcome compare them and identify if they are the same (1) or different (0)."

Includes a few-shot example comparing two restaurant booking outcomes.

Usage Example

from ragas.metrics import AgentGoalAccuracyWithReference
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage, AIMessage, ToolCall, ToolMessage
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

# Configure LLM
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4"))

# Create metric
metric = AgentGoalAccuracyWithReference()
metric.llm = evaluator_llm

# Create sample
sample = MultiTurnSample(
    user_input=[
        HumanMessage(content="Book a table at the best Chinese restaurant for 8pm"),
        AIMessage(
            content="Let me find options.",
            tool_calls=[ToolCall(name="restaurant_search", args={"cuisine": "Chinese"})]
        ),
        ToolMessage(content="Found: Golden Dragon, Jade Palace"),
        AIMessage(
            content="I'll book Golden Dragon.",
            tool_calls=[ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})]
        ),
        ToolMessage(content="Booked successfully."),
        AIMessage(content="Your table at Golden Dragon is booked for 8pm.")
    ],
    reference="A table is booked at a Chinese restaurant for 8:00pm."
)

# Evaluate
# score = await metric._multi_turn_ascore(sample, callbacks=[])
# Returns 1.0 if goal was achieved, 0.0 otherwise

Score Interpretation

Score Meaning
1.0 The agent successfully achieved the intended goal
0.0 The agent did not achieve the intended goal

Internal Dependencies

  • ragas.metrics.base.MetricWithLLM -- provides LLM integration for the metric
  • ragas.metrics.base.MultiTurnMetric -- base class for multi-turn metrics
  • ragas.prompt.PydanticPrompt -- structured prompt framework with Pydantic input/output models
  • ragas.dataset_schema.MultiTurnSample -- input sample schema

Implements

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment