Workflow:Confident ai Deepeval Component Level LLM Evaluation

Knowledge Sources	DeepEval Component-Level Evals Tracing Docs
Domains	LLM_Evaluation, Observability, Component_Testing
Last Updated	2026-02-14 09:00 GMT

Overview

End-to-end process for evaluating individual components within an LLM application using DeepEval's tracing-based observation framework and per-component metrics.

Description

This workflow covers fine-grained evaluation of internal LLM application components rather than treating the system as a black box. Using the @observe() decorator, individual functions (LLM calls, retrievers, tool calls, generators) are instrumented with specific metrics. At runtime, each observed component creates its own test case via update_current_span(), enabling metrics like AnswerRelevancyMetric on the generator and ContextualRelevancyMetric on the retriever simultaneously. The tracing is non-intrusive and does not require rewriting the application codebase.

Usage

Execute this workflow when you need to identify which specific component in a multi-step LLM pipeline is underperforming. This is appropriate for RAG pipelines (evaluate retriever and generator independently), multi-agent systems (evaluate each agent's contribution), or any composite LLM application where end-to-end evaluation alone is insufficient for debugging quality issues.

Execution Steps

Step 1: Instrument Components with Observe Decorator

Annotate each function that represents a distinct component in your LLM pipeline with the @observe() decorator. Assign relevant metrics to each component. Optionally set a custom name and span type (llm, retriever, tool, agent) for clear trace visualization.

Key considerations:

Each @observe() call creates a span in the trace tree
Metrics are passed as a list: @observe(metrics=[MetricA(), MetricB()])
Span types help categorize: @observe(type="llm"), @observe(type="retriever")
Nesting observed functions automatically creates parent-child span relationships

Step 2: Create Runtime Test Cases per Component

Within each observed function, call update_current_span() to attach an LLMTestCase with the component's specific inputs and outputs. This allows each metric to evaluate the component in isolation using its own context.

What to set per component type:

LLM/Generator: input, actual_output
Retriever: input, actual_output, retrieval_context
Tool: input, actual_output, tools_called
Agent: input, actual_output (aggregated from sub-components)

Step 3: Define the Entry Point

Create a top-level observed function that orchestrates the component calls. This function serves as the root of the trace tree and can optionally carry its own metrics for end-to-end evaluation alongside the component-level metrics.

Structure:

The entry point function calls sub-components in sequence
Each sub-component's span is nested under the entry point
The trace captures the complete execution flow

Step 4: Execute Evaluation with Dataset

Run the observed application against an evaluation dataset using EvaluationDataset with evals_iterator(). For each golden input, invoke the observed callback to generate traces and trigger component-level metric evaluation automatically.

Execution patterns:

Iterator mode: Loop through dataset.evals_iterator() and call the observed function
Direct mode: Use evaluate(observed_callback=your_app, goldens=[...])
Pytest mode: Use assert_test(golden=golden, observed_callback=your_app)

Step 5: Review Component-Level Results

Analyze results to see per-component metric scores, identifying exactly which component contributed to quality issues. Each span in the trace shows its own metrics, enabling targeted debugging and optimization.

Result breakdown:

Trace-level results (overall execution)
Span-level results (per component)
Each span shows its attached metrics with scores and reasons
Visual trace tree on Confident AI platform for interactive exploration

Execution Diagram

GitHub URL

Workflow Repository