Implementation:Confident ai Deepeval EvaluationDataset

**Metadata**
Knowledge Sources	DeepEval
Domains	LLM_Evaluation AI_Agents
Last Updated	2026-02-14 09:00 GMT

Overview

Concrete dataset class that aggregates golden test cases into a reusable evaluation collection. The EvaluationDataset is a Python dataclass that holds a list of Golden (or ConversationalGolden) objects and provides methods for dataset management, including synchronization with the Confident AI platform.

Description

The EvaluationDataset serves as the primary container for organizing evaluation test cases in DeepEval. It aggregates golden objects and provides the interface for batch evaluation, dataset persistence, and platform synchronization.

Key capabilities:

Golden aggregation -- collects multiple Golden or ConversationalGolden objects into a single dataset.
Batch evaluation support -- serves as the input to DeepEval's evaluation pipeline for running all test cases in a single operation.
Platform synchronization -- supports pushing and pulling datasets to/from the Confident AI platform via the confident_api_key.
Flexible construction -- accepts goldens at construction time or allows them to be added programmatically.

Usage

Import and construct an evaluation dataset:

from deepeval.dataset import EvaluationDataset

Code Reference

Source Location

Repository: confident-ai/deepeval
File: deepeval/dataset/dataset.py (lines 68--107)

Signature

@dataclass
class EvaluationDataset:
    goldens: Union[List[Golden], List[ConversationalGolden]] = field(default_factory=list)
    confident_api_key: Optional[str] = None

Import

from deepeval.dataset import EvaluationDataset

Parent Class

Python dataclass (not a class hierarchy; uses the @dataclass decorator)

I/O Contract

Inputs (Constructor Parameters)

**Input Contract**
Name	Type	Default	Description
`goldens`	Union[List[Golden], List[ConversationalGolden]]	`[]`	List of golden test cases that make up the evaluation dataset. Can contain either standard `Golden` objects or `ConversationalGolden` objects for multi-turn evaluation.
`confident_api_key`	Optional[str]	`None`	API key for the Confident AI platform. Used for pushing and pulling datasets to/from the platform. Falls back to environment variable if not provided.

Outputs

**Output Contract**
Name	Type	Description
Dataset object	EvaluationDataset	A dataset instance that can be passed to DeepEval's evaluation functions for batch evaluation.

Usage Examples

Example 1: Basic Dataset Construction

Create a dataset from a list of golden test cases.

from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset(goldens=[
    Golden(input="task1"),
    Golden(input="task2"),
])

The goldens parameter accepts a list of Golden objects.
The resulting dataset contains two test cases ready for evaluation.

Example 2: Dataset with Expected Outputs and Tools

Create a comprehensive evaluation dataset for agent testing.

from deepeval.dataset import EvaluationDataset, Golden
from deepeval.test_case import ToolCall

dataset = EvaluationDataset(goldens=[
    Golden(
        input="What's the weather in Tokyo?",
        expected_output="The current weather in Tokyo is...",
        expected_tools=[ToolCall(name="get_weather")],
    ),
    Golden(
        input="Calculate 15% tip on $85",
        expected_output="The tip amount is $12.75",
        expected_tools=[ToolCall(name="calculator")],
    ),
    Golden(
        input="Find recent news about AI",
        expected_output="Here are the latest AI news stories...",
        expected_tools=[ToolCall(name="search")],
    ),
])

Each golden test case specifies input, expected output, and expected tool calls.
This dataset can be used with TaskCompletionMetric, ToolUseMetric, and other agent metrics.

Example 3: Programmatic Dataset Construction

Build a dataset programmatically from an external data source.

import json
from deepeval.dataset import EvaluationDataset, Golden

with open("test_cases.json") as f:
    raw_cases = json.load(f)

goldens = [
    Golden(
        input=case["input"],
        expected_output=case.get("expected_output"),
        context=case.get("context"),
    )
    for case in raw_cases
]

dataset = EvaluationDataset(goldens=goldens)

Golden objects are constructed from external JSON data.
The dataset aggregates all test cases for batch evaluation.

Related Pages

Principle:Confident_ai_Deepeval_Evaluation_Dataset_Construction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment