Principle:Confident ai Deepeval Span Level Test Cases

**Metadata**
Knowledge Sources	DeepEval
Domains	LLM_Evaluation Observability Tracing
Last Updated	2026-02-14 09:00 GMT

Overview

A design principle for attaching evaluation data to individual traced spans at runtime, enabling metric evaluation at the component level rather than only at the end-to-end pipeline output. Span-level test cases bridge the gap between tracing (what happened) and evaluation (how well did it happen) by enriching each span with the data fields that metrics require.

Description

When an LLM application is instrumented with tracing, each component's inputs and outputs are automatically captured as spans. However, many evaluation metrics require additional context that is not directly available from function signatures alone. For example:

A faithfulness metric needs the retrieval_context (the documents retrieved) to assess whether the generated output is grounded in the provided context.
A tool correctness metric needs expected_tools to compare against tools_called.
A correctness metric needs the expected_output (ground truth) to compare against the actual output.

These supplementary data fields must be attached to the span at runtime, within the component function itself, because only the application code has access to this information. The principle of span-level test cases establishes that:

Each span can be enriched with test case data (retrieval context, expected output, metadata, etc.) during execution.
The enriched span then contains all the fields necessary for per-component metric evaluation.
This enables component-level evaluation -- metrics that assess retriever quality, generator faithfulness, or tool correctness independently within the same trace.

Without span-level test case enrichment, metrics would be limited to evaluating only the auto-captured input/output pairs, which is often insufficient for meaningful quality assessment.

Usage

Apply span-level test case enrichment when:

A component's evaluation requires data beyond its function signature -- e.g., retrieved documents, ground truth answers, or expected tool calls.
You want to evaluate retriever quality within a RAG pipeline by attaching the retrieved documents as retrieval_context to the retriever span.
You need to provide expected outputs for correctness metrics at the span level rather than only at the trace level.
You are building component-level regression tests that validate specific data flows through individual pipeline stages.

Theoretical Basis

Span-level test cases draw on two established patterns:

Context-Local State Propagation

In concurrent and asynchronous systems, context-local state (e.g., Python's contextvars) provides a mechanism for propagating request-scoped data through the call stack without explicit parameter passing. Span-level test case enrichment leverages this pattern: the current span is accessible via context-local state, and any function within the span's execution scope can update it with additional evaluation data.

The abstract pattern is:

ENRICH_SPAN(current_span, evaluation_data):
    current_span = GET_FROM_CONTEXT()
    FOR field IN evaluation_data:
        current_span.test_case[field.name] = field.value
    # Metrics can now access all enriched fields
    # when evaluating this span

This approach preserves separation of concerns -- the function's return value remains its business output, while evaluation data is attached as a side channel to the tracing context.

Span Enrichment Patterns

In distributed tracing systems (OpenTelemetry, Jaeger, Zipkin), spans are routinely enriched with attributes, events, and links after creation. Span-level test cases extend this concept to the evaluation domain: rather than enriching spans with operational metadata (HTTP status codes, database query durations), spans are enriched with evaluation-relevant data (retrieval context, expected outputs, ground truth labels).

Key properties of span enrichment for evaluation:

Incremental -- Data can be attached at any point during span execution, not only at creation time.
Composable -- Multiple calls to the enrichment function accumulate data; each call adds to the span's test case without overwriting previous fields.
Scoped -- Enrichment affects only the current span, not its parent or sibling spans. This ensures that component-level data remains isolated to the correct evaluation context.

Related Pages

Implementation:Confident_ai_Deepeval_Update_Current_Span

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment