Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Truera Trulens LangGraph Agent Evaluation

From Leeroopedia
Knowledge Sources
Domains LLM_Ops, Evaluation, Agents
Last Updated 2026-02-14 08:00 GMT

Overview

End-to-end process for evaluating LangGraph-based agents and multi-agent workflows using TruGraph instrumentation and agent-specific evaluation metrics.

Description

This workflow covers how to instrument and evaluate LangGraph agents (including Deep Agents and multi-agent orchestrations) with TruLens. TruGraph provides automatic instrumentation of LangGraph compiled graphs, capturing agent decisions, tool calls, planning steps, and generation outputs as OTEL spans. The evaluation uses agent-specific metrics such as tool selection quality, logical consistency, and plan adherence, collectively known as the Agent GPA (Grade Point Average).

Usage

Execute this workflow when you have a LangGraph-based agent or multi-agent system and need to evaluate its decision-making quality, tool usage, and overall response quality. This is the required path for LangGraph applications (TruApp with @instrument should not be used for LangGraph). It supports single agents, multi-agent orchestrations, and agents with planning capabilities.

Execution Steps

Step 1: Initialize TruLens Session

Create a TruSession instance to manage the evaluation infrastructure. The session coordinates trace collection, feedback evaluation, and result storage.

Key considerations:

  • TruSession is a singleton per process
  • Must be initialized before wrapping any LangGraph agent

Step 2: Build LangGraph Agent

Construct the LangGraph agent using the standard LangGraph API (StateGraph, nodes, edges, tools). The agent should be compiled into a CompiledGraph before wrapping with TruGraph. This step is your existing application code.

Key considerations:

  • TruGraph wraps CompiledGraph objects (the result of graph.compile())
  • Both sync and async agent invocations are supported
  • Multi-agent orchestrations with subgraphs are supported

Step 3: Define Agent Evaluation Metrics

Configure feedback functions tailored for agent evaluation. The Agent GPA metrics include tool selection (did the agent pick the right tool), tool calling (were parameters correct), tool quality (did the tool return useful results), logical consistency (is the reasoning chain coherent), execution efficiency, and optionally plan quality and plan adherence for planning agents.

Key considerations:

  • Agent metrics evaluate different aspects than RAG metrics
  • Use span types AGENT, TOOL, and GENERATION to select relevant data
  • Combine agent metrics with RAG Triad metrics if the agent performs retrieval
  • Custom criteria can be passed to feedback functions for domain-specific evaluation

Step 4: Wrap Agent With TruGraph

Wrap the compiled LangGraph agent with TruGraph, specifying app_name, app_version, and the list of feedback functions. TruGraph automatically instruments all graph nodes, edges, tool calls, and LLM invocations.

Key considerations:

  • Pass the compiled graph (not the StateGraph) to TruGraph
  • TruGraph is required for LangGraph apps; do not use TruApp with @instrument for LangGraph
  • The METHODS parameter on TruGraph controls which methods are instrumented (defaults cover standard patterns)

Step 5: Record Agent Executions

Execute the agent within the TruGraph recording context. Each agent invocation produces a rich trace showing the full decision path including node transitions, tool calls, and LLM completions.

Key considerations:

  • Use the context manager: with tru_agent as recording
  • Agent traces can be complex with multiple tool calls and node transitions
  • Both invoke() and stream() patterns are supported

Step 6: Analyze Agent Performance

Retrieve evaluation results and analyze agent performance using the dashboard or programmatic API. The leaderboard shows aggregate scores across Agent GPA metrics, while the records view allows drilling into individual agent decisions.

Key considerations:

  • Compare agent versions to measure improvement in decision-making
  • Use the trace viewer to understand individual agent decision paths
  • Identify patterns in tool selection errors or reasoning failures

Execution Diagram

GitHub URL

Workflow Repository