Implementation:Explodinggradients Ragas AgentGoalAccuracy Metric

AgentGoalAccuracy Metric

AgentGoalAccuracyWithReference and AgentGoalAccuracyWithoutReference are multi-turn evaluation metrics in the Ragas library that use LLM-based judgment to determine whether an AI agent achieved its intended goal during a conversation. They return a binary score (0.0 or 1.0).

Source Location

File: src/ragas/metrics/_goal_accuracy.py
- AgentGoalAccuracyWithReference: lines 103-144
- AgentGoalAccuracyWithoutReference: lines 147-185
Repository: explodinggradients/ragas

Import

from ragas.metrics import AgentGoalAccuracyWithReference, AgentGoalAccuracyWithoutReference

Class Definitions

AgentGoalAccuracyWithReference

@dataclass
class AgentGoalAccuracyWithReference(MetricWithLLM, MultiTurnMetric):
    name: str = "agent_goal_accuracy"
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {
            MetricType.MULTI_TURN: {
                "user_input",
                "reference",
            }
        }
    )
    output_type: t.Optional[MetricOutputType] = MetricOutputType.BINARY
    workflow_prompt: PydanticPrompt = field(
        default_factory=lambda: InferGoalOutcomePrompt()
    )
    compare_outcome_prompt: PydanticPrompt = field(
        default_factory=lambda: CompareOutcomePrompt()
    )
    max_retries: int = 1

AgentGoalAccuracyWithoutReference

@dataclass
class AgentGoalAccuracyWithoutReference(MetricWithLLM, MultiTurnMetric):
    name: str = "agent_goal_accuracy"
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {
            MetricType.MULTI_TURN: {
                "user_input",
            }
        }
    )
    workflow_prompt: PydanticPrompt = field(
        default_factory=lambda: InferGoalOutcomePrompt()
    )
    compare_outcome_prompt: PydanticPrompt = field(
        default_factory=lambda: CompareOutcomePrompt()
    )
    max_retries: int = 1

Required Columns

Variant	Required Columns	Description
`AgentGoalAccuracyWithReference`	`user_input`, `reference`	Requires both the conversation messages and a reference expected outcome string
`AgentGoalAccuracyWithoutReference`	`user_input`	Only requires the conversation messages; goal is inferred from the conversation itself

Key Methods

_multi_turn_ascore (WithReference)

async def _multi_turn_ascore(
    self,
    sample: MultiTurnSample,
    callbacks: Callbacks,
) -> float

Evaluation flow for the reference-based variant:

Asserts that self.llm and sample.reference are set
Serializes the conversation using sample.pretty_repr() and passes it to the workflow_prompt (InferGoalOutcomePrompt)
The LLM returns a WorkflowOutput with user_goal and end_state
Constructs a CompareOutcomeInput with desired_outcome=sample.reference and arrived_outcome=response.end_state
Passes this to the compare_outcome_prompt (CompareOutcomePrompt)
Returns the verdict as a float (0.0 or 1.0)

_multi_turn_ascore (WithoutReference)

async def _multi_turn_ascore(
    self,
    sample: MultiTurnSample,
    callbacks: Callbacks,
) -> float

Evaluation flow for the reference-free variant:

Asserts that self.llm is set (no reference required)
Serializes the conversation and passes it to the workflow_prompt
The LLM returns a WorkflowOutput with user_goal and end_state
Constructs a CompareOutcomeInput with desired_outcome=response.user_goal and arrived_outcome=response.end_state
Passes this to the compare_outcome_prompt
Returns the verdict as a float (0.0 or 1.0)

The key difference is that the reference-free variant uses the LLM-inferred user_goal as the desired outcome, while the reference-based variant uses the provided sample.reference.

Supporting Pydantic Models

The module defines several Pydantic models used as prompt input/output schemas:

Model	Purpose	Fields
`WorkflowInput`	Input to InferGoalOutcomePrompt	`workflow: str` -- the serialized conversation
`WorkflowOutput`	Output from InferGoalOutcomePrompt	`user_goal: str`, `end_state: str`
`CompareOutcomeInput`	Input to CompareOutcomePrompt	`desired_outcome: str`, `arrived_outcome: str`
`CompareOutcomeOutput`	Output from CompareOutcomePrompt	`reason: str`, `verdict: Literal["0", "1"]`

Prompt Classes

InferGoalOutcomePrompt

Instruction: "Given an agentic workflow comprised of Human, AI and Tools, identify the user_goal (the task or objective the user wants to achieve) and the end_state (the final outcome or result of the workflow)."

Includes a few-shot example of a restaurant booking conversation.

CompareOutcomePrompt

Instruction: "Given user goal, desired outcome and achieved outcome compare them and identify if they are the same (1) or different (0)."

Includes a few-shot example comparing two restaurant booking outcomes.

Usage Example

from ragas.metrics import AgentGoalAccuracyWithReference
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage, AIMessage, ToolCall, ToolMessage
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

# Configure LLM
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4"))

# Create metric
metric = AgentGoalAccuracyWithReference()
metric.llm = evaluator_llm

# Create sample
sample = MultiTurnSample(
    user_input=[
        HumanMessage(content="Book a table at the best Chinese restaurant for 8pm"),
        AIMessage(
            content="Let me find options.",
            tool_calls=[ToolCall(name="restaurant_search", args={"cuisine": "Chinese"})]
        ),
        ToolMessage(content="Found: Golden Dragon, Jade Palace"),
        AIMessage(
            content="I'll book Golden Dragon.",
            tool_calls=[ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})]
        ),
        ToolMessage(content="Booked successfully."),
        AIMessage(content="Your table at Golden Dragon is booked for 8pm.")
    ],
    reference="A table is booked at a Chinese restaurant for 8:00pm."
)

# Evaluate
# score = await metric._multi_turn_ascore(sample, callbacks=[])
# Returns 1.0 if goal was achieved, 0.0 otherwise

Score Interpretation

Score	Meaning
1.0	The agent successfully achieved the intended goal
0.0	The agent did not achieve the intended goal

Internal Dependencies

ragas.metrics.base.MetricWithLLM -- provides LLM integration for the metric
ragas.metrics.base.MultiTurnMetric -- base class for multi-turn metrics
ragas.prompt.PydanticPrompt -- structured prompt framework with Pydantic input/output models
ragas.dataset_schema.MultiTurnSample -- input sample schema

Implements

Agent Goal Accuracy Evaluation -- the evaluation principle these metrics implement

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment