Workflow:Confident ai Deepeval Component Level LLM Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLM_Evaluation, Observability, Component_Testing |
| Last Updated | 2026-02-14 09:00 GMT |
Overview
End-to-end process for evaluating individual components within an LLM application using DeepEval's tracing-based observation framework and per-component metrics.
Description
This workflow covers fine-grained evaluation of internal LLM application components rather than treating the system as a black box. Using the @observe() decorator, individual functions (LLM calls, retrievers, tool calls, generators) are instrumented with specific metrics. At runtime, each observed component creates its own test case via update_current_span(), enabling metrics like AnswerRelevancyMetric on the generator and ContextualRelevancyMetric on the retriever simultaneously. The tracing is non-intrusive and does not require rewriting the application codebase.
Usage
Execute this workflow when you need to identify which specific component in a multi-step LLM pipeline is underperforming. This is appropriate for RAG pipelines (evaluate retriever and generator independently), multi-agent systems (evaluate each agent's contribution), or any composite LLM application where end-to-end evaluation alone is insufficient for debugging quality issues.
Execution Steps
Step 1: Instrument Components with Observe Decorator
Annotate each function that represents a distinct component in your LLM pipeline with the @observe() decorator. Assign relevant metrics to each component. Optionally set a custom name and span type (llm, retriever, tool, agent) for clear trace visualization.
Key considerations:
- Each @observe() call creates a span in the trace tree
- Metrics are passed as a list: @observe(metrics=[MetricA(), MetricB()])
- Span types help categorize: @observe(type="llm"), @observe(type="retriever")
- Nesting observed functions automatically creates parent-child span relationships
Step 2: Create Runtime Test Cases per Component
Within each observed function, call update_current_span() to attach an LLMTestCase with the component's specific inputs and outputs. This allows each metric to evaluate the component in isolation using its own context.
What to set per component type:
- LLM/Generator: input, actual_output
- Retriever: input, actual_output, retrieval_context
- Tool: input, actual_output, tools_called
- Agent: input, actual_output (aggregated from sub-components)
Step 3: Define the Entry Point
Create a top-level observed function that orchestrates the component calls. This function serves as the root of the trace tree and can optionally carry its own metrics for end-to-end evaluation alongside the component-level metrics.
Structure:
- The entry point function calls sub-components in sequence
- Each sub-component's span is nested under the entry point
- The trace captures the complete execution flow
Step 4: Execute Evaluation with Dataset
Run the observed application against an evaluation dataset using EvaluationDataset with evals_iterator(). For each golden input, invoke the observed callback to generate traces and trigger component-level metric evaluation automatically.
Execution patterns:
- Iterator mode: Loop through dataset.evals_iterator() and call the observed function
- Direct mode: Use evaluate(observed_callback=your_app, goldens=[...])
- Pytest mode: Use assert_test(golden=golden, observed_callback=your_app)
Step 5: Review Component-Level Results
Analyze results to see per-component metric scores, identifying exactly which component contributed to quality issues. Each span in the trace shows its own metrics, enabling targeted debugging and optimization.
Result breakdown:
- Trace-level results (overall execution)
- Span-level results (per component)
- Each span shows its attached metrics with scores and reasons
- Visual trace tree on Confident AI platform for interactive exploration