Implementation:Confident ai Deepeval EvaluationDataset Evals Iterator

**Metadata**
Knowledge Sources	DeepEval
Domains	LLM_Evaluation Observability Tracing
Last Updated	2026-02-14 09:00 GMT

Overview

The EvaluationDataset.evals_iterator method provides an iterator-based interface for evaluating traced LLM applications across a dataset of golden test cases. It yields each Golden object in sequence, allowing the caller to invoke the instrumented application for each golden, while the iterator manages trace collection, metric evaluation, and result aggregation behind the scenes.

Description

The evals_iterator method is the primary mechanism for combining dataset-driven testing with trace-based evaluation in DeepEval. When called, it returns an iterator over the dataset's goldens. For each golden yielded:

The caller invokes the @observe(type="agent")-decorated application function with the golden's input.
The application execution produces a complete trace (root span plus child spans).
After the iteration completes (or the for loop exits), the iterator evaluates all trace-level and span-level metrics against the collected traces.
Results are aggregated into an EvaluationResult containing per-trace and per-metric scores.

Key behaviors:

Lazy iteration -- Goldens are yielded one at a time, allowing the application to be invoked within the loop body. This gives the caller full control over how the application is called (e.g., passing additional context, handling errors).
Automatic trace binding -- The iterator automatically associates each golden with the trace produced during that iteration. No explicit trace management is required.
Configurable metrics -- Trace-level metrics can be passed via the metrics parameter. These are evaluated against each trace's root span data after all iterations complete.
Evaluation run identity -- The identifier parameter names the evaluation run, enabling comparison across multiple runs (e.g., before and after a prompt change).
Display, cache, error, and async configuration -- Optional configuration objects control progress display, result caching, error handling, and async evaluation behavior.

Usage

Import and use within a standard Python for loop:

from deepeval.dataset import EvaluationDataset

The @observe(type="agent")-decorated application function should be called inside the loop body for each yielded golden.

Code Reference

Source Location

Repository: confident-ai/deepeval
File: deepeval/dataset/dataset.py (lines 1300--1429)

Signature

def evals_iterator(
    self,
    metrics=None,
    identifier=None,
    display_config=None,
    cache_config=None,
    error_config=None,
    async_config=None,
    run_otel=False,
) -> Iterator[Golden]:
    ...

Import

from deepeval.dataset import EvaluationDataset, Golden

I/O Contract

Inputs

**Input Contract**
Name	Type	Description
`self`	EvaluationDataset	The dataset instance containing a list of `Golden` objects to iterate over.
`metrics`	Optional[List[BaseMetric]]	Trace-level metrics to evaluate against each collected trace after iteration completes.
`identifier`	Optional[str]	A name for the evaluation run, used to identify and compare runs in dashboards or result stores.
`display_config`	Optional[DisplayConfig]	Configuration for progress display during evaluation (e.g., progress bars, verbosity).
`cache_config`	Optional[CacheConfig]	Configuration for caching evaluation results to avoid redundant metric computation.
`error_config`	Optional[ErrorConfig]	Configuration for error handling behavior (e.g., fail-fast vs. continue-on-error).
`async_config`	Optional[AsyncConfig]	Configuration for asynchronous metric evaluation (e.g., concurrency limits).
`run_otel`	bool	When `True`, enables OpenTelemetry-based trace export in addition to DeepEval's native tracing. Defaults to `False`.

Outputs

**Output Contract**
Name	Type	Description
Iterator[Golden]	Iterator	Yields `Golden` objects one at a time from the dataset. Each golden contains at minimum an `input` field and optionally `expected_output`, `context`, and other reference data.
EvaluationResult	EvaluationResult (after iteration)	After the iteration completes, the evaluation results (per-trace metric scores, aggregated statistics) are available via the dataset or a returned result object.

Usage Examples

Example 1: Basic Dataset-Driven Trace Evaluation

Iterating over a dataset of goldens to evaluate a RAG pipeline with trace-level metrics.

from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import observe

@observe(type="agent")
def rag_pipeline(query: str) -> str:
    docs = retrieve(query)
    return generate(query, docs)

dataset = EvaluationDataset(goldens=[
    Golden(input="What is AI?"),
    Golden(input="Explain neural networks"),
    Golden(input="What is backpropagation?"),
])

metric = AnswerRelevancyMetric()

for golden in dataset.evals_iterator(metrics=[metric]):
    rag_pipeline(golden.input)

Each golden is yielded by the iterator, and the caller invokes the @observe(type="agent")-decorated pipeline.
After the loop completes, the AnswerRelevancyMetric is evaluated against each trace's root span.

Example 2: Named Evaluation Run with Expected Outputs

Using the identifier parameter to name the evaluation run for comparison purposes.

from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

dataset = EvaluationDataset(goldens=[
    Golden(
        input="What is machine learning?",
        expected_output="Machine learning is a subset of AI..."
    ),
    Golden(
        input="Define deep learning",
        expected_output="Deep learning uses neural networks..."
    ),
])

for golden in dataset.evals_iterator(
    metrics=[AnswerRelevancyMetric(), FaithfulnessMetric()],
    identifier="v2_prompt_experiment"
):
    rag_pipeline(golden.input)

The identifier="v2_prompt_experiment" names this evaluation run, enabling comparison with other runs (e.g., "v1_baseline").
Each golden includes an expected_output, which is available to correctness-based metrics.

Example 3: Evaluation with Error Handling

Continuing evaluation even if individual golden invocations fail.

from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset(goldens=[
    Golden(input="Query that might fail"),
    Golden(input="Normal query"),
])

for golden in dataset.evals_iterator(
    metrics=[metric],
    identifier="robustness_test"
):
    try:
        rag_pipeline(golden.input)
    except Exception as e:
        print(f"Failed on: {golden.input}, error: {e}")

Error handling is the caller's responsibility within the loop body.
The iterator continues yielding goldens regardless of application-level exceptions.

Related Pages

Principle:Confident_ai_Deepeval_Dataset_Driven_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment