Implementation:Explodinggradients Ragas AgentGoalAccuracy Metric
AgentGoalAccuracy Metric
AgentGoalAccuracyWithReference and AgentGoalAccuracyWithoutReference are multi-turn evaluation metrics in the Ragas library that use LLM-based judgment to determine whether an AI agent achieved its intended goal during a conversation. They return a binary score (0.0 or 1.0).
Source Location
- File:
src/ragas/metrics/_goal_accuracy.pyAgentGoalAccuracyWithReference: lines 103-144AgentGoalAccuracyWithoutReference: lines 147-185
- Repository: explodinggradients/ragas
Import
from ragas.metrics import AgentGoalAccuracyWithReference, AgentGoalAccuracyWithoutReference
Class Definitions
AgentGoalAccuracyWithReference
@dataclass
class AgentGoalAccuracyWithReference(MetricWithLLM, MultiTurnMetric):
name: str = "agent_goal_accuracy"
_required_columns: t.Dict[MetricType, t.Set[str]] = field(
default_factory=lambda: {
MetricType.MULTI_TURN: {
"user_input",
"reference",
}
}
)
output_type: t.Optional[MetricOutputType] = MetricOutputType.BINARY
workflow_prompt: PydanticPrompt = field(
default_factory=lambda: InferGoalOutcomePrompt()
)
compare_outcome_prompt: PydanticPrompt = field(
default_factory=lambda: CompareOutcomePrompt()
)
max_retries: int = 1
AgentGoalAccuracyWithoutReference
@dataclass
class AgentGoalAccuracyWithoutReference(MetricWithLLM, MultiTurnMetric):
name: str = "agent_goal_accuracy"
_required_columns: t.Dict[MetricType, t.Set[str]] = field(
default_factory=lambda: {
MetricType.MULTI_TURN: {
"user_input",
}
}
)
workflow_prompt: PydanticPrompt = field(
default_factory=lambda: InferGoalOutcomePrompt()
)
compare_outcome_prompt: PydanticPrompt = field(
default_factory=lambda: CompareOutcomePrompt()
)
max_retries: int = 1
Required Columns
| Variant | Required Columns | Description |
|---|---|---|
AgentGoalAccuracyWithReference |
user_input, reference |
Requires both the conversation messages and a reference expected outcome string |
AgentGoalAccuracyWithoutReference |
user_input |
Only requires the conversation messages; goal is inferred from the conversation itself |
Key Methods
_multi_turn_ascore (WithReference)
async def _multi_turn_ascore(
self,
sample: MultiTurnSample,
callbacks: Callbacks,
) -> float
Evaluation flow for the reference-based variant:
- Asserts that
self.llmandsample.referenceare set - Serializes the conversation using
sample.pretty_repr()and passes it to theworkflow_prompt(InferGoalOutcomePrompt) - The LLM returns a
WorkflowOutputwithuser_goalandend_state - Constructs a
CompareOutcomeInputwithdesired_outcome=sample.referenceandarrived_outcome=response.end_state - Passes this to the
compare_outcome_prompt(CompareOutcomePrompt) - Returns the verdict as a float (0.0 or 1.0)
_multi_turn_ascore (WithoutReference)
async def _multi_turn_ascore(
self,
sample: MultiTurnSample,
callbacks: Callbacks,
) -> float
Evaluation flow for the reference-free variant:
- Asserts that
self.llmis set (no reference required) - Serializes the conversation and passes it to the
workflow_prompt - The LLM returns a
WorkflowOutputwithuser_goalandend_state - Constructs a
CompareOutcomeInputwithdesired_outcome=response.user_goalandarrived_outcome=response.end_state - Passes this to the
compare_outcome_prompt - Returns the verdict as a float (0.0 or 1.0)
The key difference is that the reference-free variant uses the LLM-inferred user_goal as the desired outcome, while the reference-based variant uses the provided sample.reference.
Supporting Pydantic Models
The module defines several Pydantic models used as prompt input/output schemas:
| Model | Purpose | Fields |
|---|---|---|
WorkflowInput |
Input to InferGoalOutcomePrompt | workflow: str -- the serialized conversation
|
WorkflowOutput |
Output from InferGoalOutcomePrompt | user_goal: str, end_state: str
|
CompareOutcomeInput |
Input to CompareOutcomePrompt | desired_outcome: str, arrived_outcome: str
|
CompareOutcomeOutput |
Output from CompareOutcomePrompt | reason: str, verdict: Literal["0", "1"]
|
Prompt Classes
InferGoalOutcomePrompt
Instruction: "Given an agentic workflow comprised of Human, AI and Tools, identify the user_goal (the task or objective the user wants to achieve) and the end_state (the final outcome or result of the workflow)."
Includes a few-shot example of a restaurant booking conversation.
CompareOutcomePrompt
Instruction: "Given user goal, desired outcome and achieved outcome compare them and identify if they are the same (1) or different (0)."
Includes a few-shot example comparing two restaurant booking outcomes.
Usage Example
from ragas.metrics import AgentGoalAccuracyWithReference
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage, AIMessage, ToolCall, ToolMessage
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
# Configure LLM
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4"))
# Create metric
metric = AgentGoalAccuracyWithReference()
metric.llm = evaluator_llm
# Create sample
sample = MultiTurnSample(
user_input=[
HumanMessage(content="Book a table at the best Chinese restaurant for 8pm"),
AIMessage(
content="Let me find options.",
tool_calls=[ToolCall(name="restaurant_search", args={"cuisine": "Chinese"})]
),
ToolMessage(content="Found: Golden Dragon, Jade Palace"),
AIMessage(
content="I'll book Golden Dragon.",
tool_calls=[ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8pm"})]
),
ToolMessage(content="Booked successfully."),
AIMessage(content="Your table at Golden Dragon is booked for 8pm.")
],
reference="A table is booked at a Chinese restaurant for 8:00pm."
)
# Evaluate
# score = await metric._multi_turn_ascore(sample, callbacks=[])
# Returns 1.0 if goal was achieved, 0.0 otherwise
Score Interpretation
| Score | Meaning |
|---|---|
| 1.0 | The agent successfully achieved the intended goal |
| 0.0 | The agent did not achieve the intended goal |
Internal Dependencies
ragas.metrics.base.MetricWithLLM-- provides LLM integration for the metricragas.metrics.base.MultiTurnMetric-- base class for multi-turn metricsragas.prompt.PydanticPrompt-- structured prompt framework with Pydantic input/output modelsragas.dataset_schema.MultiTurnSample-- input sample schema
Implements
- Agent Goal Accuracy Evaluation -- the evaluation principle these metrics implement
See Also
- Principle:Explodinggradients_Ragas_Agent_Goal_Accuracy_Evaluation
- ToolCallAccuracy Metric -- evaluating individual tool call correctness
- TopicAdherenceScore Metric -- evaluating topic adherence
- MultiTurnSample Class -- the data schema for multi-turn evaluation samples