Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Truera Trulens Agent Evaluation Metrics

From Leeroopedia
Knowledge Sources
Domains Agent_Evaluation, LLM_Evaluation
Last Updated 2026-02-14 08:00 GMT

Overview

A specialized evaluation framework that assesses the quality of agentic LLM traces through rubric-based metrics focused on tool selection, reasoning, and answer quality.

Description

Agent Evaluation Metrics extend the standard feedback function framework with agent-specific evaluation capabilities. Unlike simple input-output evaluation, agent metrics operate on full execution traces — the complete sequence of tool calls, reasoning steps, and intermediate outputs produced by an agent.

The primary agent metric is Tool Selection Quality, which evaluates whether the agent chose appropriate tools for the task. This uses a rubric-based LLM judge operating on a compressed trace representation.

The Agent GPA combines multiple metrics:

  • Tool Selection: Did the agent use the right tools? (trace-level evaluation)
  • Answer Relevance: Does the final answer address the question? (input-output evaluation)
  • Groundedness: Is the answer supported by retrieved evidence? (context-output evaluation)

Usage

Use this principle when evaluating LangGraph agents or other multi-step agentic workflows. Define agent metrics using trace-level selectors that capture the full execution trace. The tool_selection_with_cot_reasons method is the primary agent evaluation function.

Theoretical Basis

Agent evaluation requires trace-level assessment rather than point-wise evaluation:

AgentGPA=1Ni=1Nwimetrici(trace)

Where each metric evaluates a different quality dimension of the full agent trace.

Trace compression is used to reduce token usage when passing traces to LLM judges:

  • Remove redundant span attributes
  • Summarize repeated tool invocations
  • Preserve essential reasoning chain

Pseudo-code Logic:

# Abstract agent evaluation pattern
f_tool_selection = Feedback(
    provider.tool_selection_with_cot_reasons
).on(trace=Selector(trace_level=True))

# The LLM judge receives the compressed trace and scores:
# 0: Poor tool selection
# 1: Partially appropriate tools
# 2: Mostly appropriate tools
# 3: Excellent tool selection

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment