Principle:Truera Trulens Agent Evaluation Metrics

Knowledge Sources	TruLens TruLens Docs Judging LLM-as-a-Judge
Domains	Agent_Evaluation, LLM_Evaluation
Last Updated	2026-02-14 08:00 GMT

Overview

A specialized evaluation framework that assesses the quality of agentic LLM traces through rubric-based metrics focused on tool selection, reasoning, and answer quality.

Description

Agent Evaluation Metrics extend the standard feedback function framework with agent-specific evaluation capabilities. Unlike simple input-output evaluation, agent metrics operate on full execution traces — the complete sequence of tool calls, reasoning steps, and intermediate outputs produced by an agent.

The primary agent metric is Tool Selection Quality, which evaluates whether the agent chose appropriate tools for the task. This uses a rubric-based LLM judge operating on a compressed trace representation.

The Agent GPA combines multiple metrics:

Tool Selection: Did the agent use the right tools? (trace-level evaluation)
Answer Relevance: Does the final answer address the question? (input-output evaluation)
Groundedness: Is the answer supported by retrieved evidence? (context-output evaluation)

Usage

Use this principle when evaluating LangGraph agents or other multi-step agentic workflows. Define agent metrics using trace-level selectors that capture the full execution trace. The tool_selection_with_cot_reasons method is the primary agent evaluation function.

Theoretical Basis

Agent evaluation requires trace-level assessment rather than point-wise evaluation:

$A g e n t G P A = \frac{1}{N} \sum_{i = 1}^{N} w_{i} \cdot m e t r i c_{i} (t r a c e)$

Where each metric evaluates a different quality dimension of the full agent trace.

Trace compression is used to reduce token usage when passing traces to LLM judges:

Remove redundant span attributes
Summarize repeated tool invocations
Preserve essential reasoning chain

Pseudo-code Logic:

# Abstract agent evaluation pattern
f_tool_selection = Feedback(
    provider.tool_selection_with_cot_reasons
).on(trace=Selector(trace_level=True))

# The LLM judge receives the compressed trace and scores:
# 0: Poor tool selection
# 1: Partially appropriate tools
# 2: Mostly appropriate tools
# 3: Excellent tool selection

Related Pages

Implemented By

Implementation:Truera_Trulens_Feedback_Tool_Selection

Uses Heuristic

Heuristic:Truera_Trulens_Trace_Compression_Token_Limits

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment