Implementation:Confident ai Deepeval EvaluationDataset Evals Iterator
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-14 09:00 GMT |
Overview
The EvaluationDataset.evals_iterator method provides an iterator-based interface for evaluating traced LLM applications across a dataset of golden test cases. It yields each Golden object in sequence, allowing the caller to invoke the instrumented application for each golden, while the iterator manages trace collection, metric evaluation, and result aggregation behind the scenes.
Description
The evals_iterator method is the primary mechanism for combining dataset-driven testing with trace-based evaluation in DeepEval. When called, it returns an iterator over the dataset's goldens. For each golden yielded:
- The caller invokes the
@observe(type="agent")-decorated application function with the golden's input. - The application execution produces a complete trace (root span plus child spans).
- After the iteration completes (or the
forloop exits), the iterator evaluates all trace-level and span-level metrics against the collected traces. - Results are aggregated into an
EvaluationResultcontaining per-trace and per-metric scores.
Key behaviors:
- Lazy iteration -- Goldens are yielded one at a time, allowing the application to be invoked within the loop body. This gives the caller full control over how the application is called (e.g., passing additional context, handling errors).
- Automatic trace binding -- The iterator automatically associates each golden with the trace produced during that iteration. No explicit trace management is required.
- Configurable metrics -- Trace-level metrics can be passed via the
metricsparameter. These are evaluated against each trace's root span data after all iterations complete. - Evaluation run identity -- The
identifierparameter names the evaluation run, enabling comparison across multiple runs (e.g., before and after a prompt change). - Display, cache, error, and async configuration -- Optional configuration objects control progress display, result caching, error handling, and async evaluation behavior.
Usage
Import and use within a standard Python for loop:
from deepeval.dataset import EvaluationDataset
The @observe(type="agent")-decorated application function should be called inside the loop body for each yielded golden.
Code Reference
Source Location
- Repository:
confident-ai/deepeval - File:
deepeval/dataset/dataset.py(lines 1300--1429)
Signature
def evals_iterator(
self,
metrics=None,
identifier=None,
display_config=None,
cache_config=None,
error_config=None,
async_config=None,
run_otel=False,
) -> Iterator[Golden]:
...
Import
from deepeval.dataset import EvaluationDataset, Golden
I/O Contract
Inputs
| Name | Type | Description |
|---|---|---|
self |
EvaluationDataset | The dataset instance containing a list of Golden objects to iterate over.
|
metrics |
Optional[List[BaseMetric]] | Trace-level metrics to evaluate against each collected trace after iteration completes. |
identifier |
Optional[str] | A name for the evaluation run, used to identify and compare runs in dashboards or result stores. |
display_config |
Optional[DisplayConfig] | Configuration for progress display during evaluation (e.g., progress bars, verbosity). |
cache_config |
Optional[CacheConfig] | Configuration for caching evaluation results to avoid redundant metric computation. |
error_config |
Optional[ErrorConfig] | Configuration for error handling behavior (e.g., fail-fast vs. continue-on-error). |
async_config |
Optional[AsyncConfig] | Configuration for asynchronous metric evaluation (e.g., concurrency limits). |
run_otel |
bool | When True, enables OpenTelemetry-based trace export in addition to DeepEval's native tracing. Defaults to False.
|
Outputs
| Name | Type | Description |
|---|---|---|
| Iterator[Golden] | Iterator | Yields Golden objects one at a time from the dataset. Each golden contains at minimum an input field and optionally expected_output, context, and other reference data.
|
| EvaluationResult | EvaluationResult (after iteration) | After the iteration completes, the evaluation results (per-trace metric scores, aggregated statistics) are available via the dataset or a returned result object. |
Usage Examples
Example 1: Basic Dataset-Driven Trace Evaluation
Iterating over a dataset of goldens to evaluate a RAG pipeline with trace-level metrics.
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import observe
@observe(type="agent")
def rag_pipeline(query: str) -> str:
docs = retrieve(query)
return generate(query, docs)
dataset = EvaluationDataset(goldens=[
Golden(input="What is AI?"),
Golden(input="Explain neural networks"),
Golden(input="What is backpropagation?"),
])
metric = AnswerRelevancyMetric()
for golden in dataset.evals_iterator(metrics=[metric]):
rag_pipeline(golden.input)
- Each golden is yielded by the iterator, and the caller invokes the
@observe(type="agent")-decorated pipeline. - After the loop completes, the
AnswerRelevancyMetricis evaluated against each trace's root span.
Example 2: Named Evaluation Run with Expected Outputs
Using the identifier parameter to name the evaluation run for comparison purposes.
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
dataset = EvaluationDataset(goldens=[
Golden(
input="What is machine learning?",
expected_output="Machine learning is a subset of AI..."
),
Golden(
input="Define deep learning",
expected_output="Deep learning uses neural networks..."
),
])
for golden in dataset.evals_iterator(
metrics=[AnswerRelevancyMetric(), FaithfulnessMetric()],
identifier="v2_prompt_experiment"
):
rag_pipeline(golden.input)
- The
identifier="v2_prompt_experiment"names this evaluation run, enabling comparison with other runs (e.g.,"v1_baseline"). - Each golden includes an
expected_output, which is available to correctness-based metrics.
Example 3: Evaluation with Error Handling
Continuing evaluation even if individual golden invocations fail.
from deepeval.dataset import EvaluationDataset, Golden
dataset = EvaluationDataset(goldens=[
Golden(input="Query that might fail"),
Golden(input="Normal query"),
])
for golden in dataset.evals_iterator(
metrics=[metric],
identifier="robustness_test"
):
try:
rag_pipeline(golden.input)
except Exception as e:
print(f"Failed on: {golden.input}, error: {e}")
- Error handling is the caller's responsibility within the loop body.
- The iterator continues yielding goldens regardless of application-level exceptions.