Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Confident ai Deepeval EvaluationDataset Evals Iterator

From Leeroopedia
Metadata
Knowledge Sources
Domains
Last Updated 2026-02-14 09:00 GMT

Overview

The EvaluationDataset.evals_iterator method provides an iterator-based interface for evaluating traced LLM applications across a dataset of golden test cases. It yields each Golden object in sequence, allowing the caller to invoke the instrumented application for each golden, while the iterator manages trace collection, metric evaluation, and result aggregation behind the scenes.

Description

The evals_iterator method is the primary mechanism for combining dataset-driven testing with trace-based evaluation in DeepEval. When called, it returns an iterator over the dataset's goldens. For each golden yielded:

  1. The caller invokes the @observe(type="agent")-decorated application function with the golden's input.
  2. The application execution produces a complete trace (root span plus child spans).
  3. After the iteration completes (or the for loop exits), the iterator evaluates all trace-level and span-level metrics against the collected traces.
  4. Results are aggregated into an EvaluationResult containing per-trace and per-metric scores.

Key behaviors:

  • Lazy iteration -- Goldens are yielded one at a time, allowing the application to be invoked within the loop body. This gives the caller full control over how the application is called (e.g., passing additional context, handling errors).
  • Automatic trace binding -- The iterator automatically associates each golden with the trace produced during that iteration. No explicit trace management is required.
  • Configurable metrics -- Trace-level metrics can be passed via the metrics parameter. These are evaluated against each trace's root span data after all iterations complete.
  • Evaluation run identity -- The identifier parameter names the evaluation run, enabling comparison across multiple runs (e.g., before and after a prompt change).
  • Display, cache, error, and async configuration -- Optional configuration objects control progress display, result caching, error handling, and async evaluation behavior.

Usage

Import and use within a standard Python for loop:

from deepeval.dataset import EvaluationDataset

The @observe(type="agent")-decorated application function should be called inside the loop body for each yielded golden.

Code Reference

Source Location

  • Repository: confident-ai/deepeval
  • File: deepeval/dataset/dataset.py (lines 1300--1429)

Signature

def evals_iterator(
    self,
    metrics=None,
    identifier=None,
    display_config=None,
    cache_config=None,
    error_config=None,
    async_config=None,
    run_otel=False,
) -> Iterator[Golden]:
    ...

Import

from deepeval.dataset import EvaluationDataset, Golden

I/O Contract

Inputs

Input Contract
Name Type Description
self EvaluationDataset The dataset instance containing a list of Golden objects to iterate over.
metrics Optional[List[BaseMetric]] Trace-level metrics to evaluate against each collected trace after iteration completes.
identifier Optional[str] A name for the evaluation run, used to identify and compare runs in dashboards or result stores.
display_config Optional[DisplayConfig] Configuration for progress display during evaluation (e.g., progress bars, verbosity).
cache_config Optional[CacheConfig] Configuration for caching evaluation results to avoid redundant metric computation.
error_config Optional[ErrorConfig] Configuration for error handling behavior (e.g., fail-fast vs. continue-on-error).
async_config Optional[AsyncConfig] Configuration for asynchronous metric evaluation (e.g., concurrency limits).
run_otel bool When True, enables OpenTelemetry-based trace export in addition to DeepEval's native tracing. Defaults to False.

Outputs

Output Contract
Name Type Description
Iterator[Golden] Iterator Yields Golden objects one at a time from the dataset. Each golden contains at minimum an input field and optionally expected_output, context, and other reference data.
EvaluationResult EvaluationResult (after iteration) After the iteration completes, the evaluation results (per-trace metric scores, aggregated statistics) are available via the dataset or a returned result object.

Usage Examples

Example 1: Basic Dataset-Driven Trace Evaluation

Iterating over a dataset of goldens to evaluate a RAG pipeline with trace-level metrics.

from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import observe

@observe(type="agent")
def rag_pipeline(query: str) -> str:
    docs = retrieve(query)
    return generate(query, docs)

dataset = EvaluationDataset(goldens=[
    Golden(input="What is AI?"),
    Golden(input="Explain neural networks"),
    Golden(input="What is backpropagation?"),
])

metric = AnswerRelevancyMetric()

for golden in dataset.evals_iterator(metrics=[metric]):
    rag_pipeline(golden.input)
  • Each golden is yielded by the iterator, and the caller invokes the @observe(type="agent")-decorated pipeline.
  • After the loop completes, the AnswerRelevancyMetric is evaluated against each trace's root span.

Example 2: Named Evaluation Run with Expected Outputs

Using the identifier parameter to name the evaluation run for comparison purposes.

from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

dataset = EvaluationDataset(goldens=[
    Golden(
        input="What is machine learning?",
        expected_output="Machine learning is a subset of AI..."
    ),
    Golden(
        input="Define deep learning",
        expected_output="Deep learning uses neural networks..."
    ),
])

for golden in dataset.evals_iterator(
    metrics=[AnswerRelevancyMetric(), FaithfulnessMetric()],
    identifier="v2_prompt_experiment"
):
    rag_pipeline(golden.input)
  • The identifier="v2_prompt_experiment" names this evaluation run, enabling comparison with other runs (e.g., "v1_baseline").
  • Each golden includes an expected_output, which is available to correctness-based metrics.

Example 3: Evaluation with Error Handling

Continuing evaluation even if individual golden invocations fail.

from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset(goldens=[
    Golden(input="Query that might fail"),
    Golden(input="Normal query"),
])

for golden in dataset.evals_iterator(
    metrics=[metric],
    identifier="robustness_test"
):
    try:
        rag_pipeline(golden.input)
    except Exception as e:
        print(f"Failed on: {golden.input}, error: {e}")
  • Error handling is the caller's responsibility within the loop body.
  • The iterator continues yielding goldens regardless of application-level exceptions.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment