Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Confident ai Deepeval Framework Instrumentation

From Leeroopedia
Metadata
Knowledge Sources
Domains
Last Updated 2026-02-14 09:00 GMT

Overview

A design principle for instrumenting third-party agent frameworks to enable automatic trace collection during agent execution. Because each agent framework exposes different callback, hook, or middleware mechanisms, framework-specific adapters are required to capture execution traces in a unified format suitable for evaluation.

Description

Modern AI agent frameworks such as LangChain, LangGraph, PydanticAI, and the OpenAI Agents SDK each provide their own extensibility interfaces for observing agent behavior at runtime. These interfaces differ significantly:

  • LangChain/LangGraph use a callback handler pattern -- classes that inherit from BaseCallbackHandler and receive lifecycle events (LLM start, tool call, chain completion, errors) as method invocations.
  • PydanticAI leverages OpenTelemetry instrumentation settings -- span processors that capture execution traces as OTEL spans.
  • OpenAI Agents SDK uses a tracing processor interface -- classes implementing TracingProcessor that receive span start/end events.

The framework instrumentation principle recognizes that a single universal adapter is insufficient. Instead, each integration must implement the adapter pattern to translate framework-specific events into DeepEval's internal trace representation (consisting of LLM calls, tool invocations, agent steps, and nested spans). This enables downstream evaluation metrics to operate on a consistent data model regardless of which agent framework produced the trace.

Usage

Framework instrumentation is used when:

  • An agent built with a supported framework needs to be automatically evaluated without manually constructing test cases.
  • Developers want to capture production traces for offline evaluation or monitoring.
  • Conversation-level metrics (task completion, tool use correctness, step efficiency) require full execution traces rather than simple input/output pairs.

The general pattern is:

FRAMEWORK_INSTRUMENTATION(framework F):
    1. IDENTIFY the callback/hook interface provided by F
    2. IMPLEMENT an adapter class conforming to F's interface
    3. On each lifecycle event (LLM call, tool use, agent step):
        a. TRANSLATE the event into DeepEval's internal trace format
        b. ACCUMULATE trace spans in a hierarchical structure
    4. On execution completion:
        a. FINALIZE the trace
        b. OPTIONALLY run evaluation metrics against the collected trace
        c. OPTIONALLY push traces to Confident AI platform

Theoretical Basis

This principle draws from several established software engineering patterns:

  • Adapter pattern -- each framework integration adapts a framework-specific interface to DeepEval's internal trace model, allowing the evaluation engine to remain framework-agnostic.
  • Framework integration -- the instrumentation hooks into existing extension points rather than requiring source code modification, following the open-closed principle.
  • Callback-based instrumentation -- by subscribing to lifecycle callbacks, the instrumentation layer observes agent behavior passively without altering execution semantics. This is analogous to aspect-oriented programming where cross-cutting concerns (tracing, evaluation) are separated from core logic.

The key insight is that evaluation should be decoupled from the agent framework. By standardizing on a common trace format and providing per-framework adapters, DeepEval achieves broad framework coverage while maintaining a single evaluation pipeline.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment