Principle:Truera Trulens Agent Evaluation Metrics
| Knowledge Sources | |
|---|---|
| Domains | Agent_Evaluation, LLM_Evaluation |
| Last Updated | 2026-02-14 08:00 GMT |
Overview
A specialized evaluation framework that assesses the quality of agentic LLM traces through rubric-based metrics focused on tool selection, reasoning, and answer quality.
Description
Agent Evaluation Metrics extend the standard feedback function framework with agent-specific evaluation capabilities. Unlike simple input-output evaluation, agent metrics operate on full execution traces — the complete sequence of tool calls, reasoning steps, and intermediate outputs produced by an agent.
The primary agent metric is Tool Selection Quality, which evaluates whether the agent chose appropriate tools for the task. This uses a rubric-based LLM judge operating on a compressed trace representation.
The Agent GPA combines multiple metrics:
- Tool Selection: Did the agent use the right tools? (trace-level evaluation)
- Answer Relevance: Does the final answer address the question? (input-output evaluation)
- Groundedness: Is the answer supported by retrieved evidence? (context-output evaluation)
Usage
Use this principle when evaluating LangGraph agents or other multi-step agentic workflows. Define agent metrics using trace-level selectors that capture the full execution trace. The tool_selection_with_cot_reasons method is the primary agent evaluation function.
Theoretical Basis
Agent evaluation requires trace-level assessment rather than point-wise evaluation:
Where each metric evaluates a different quality dimension of the full agent trace.
Trace compression is used to reduce token usage when passing traces to LLM judges:
- Remove redundant span attributes
- Summarize repeated tool invocations
- Preserve essential reasoning chain
Pseudo-code Logic:
# Abstract agent evaluation pattern
f_tool_selection = Feedback(
provider.tool_selection_with_cot_reasons
).on(trace=Selector(trace_level=True))
# The LLM judge receives the compressed trace and scores:
# 0: Poor tool selection
# 1: Partially appropriate tools
# 2: Mostly appropriate tools
# 3: Excellent tool selection