Implementation:Confident ai Deepeval EvaluationDataset
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-14 09:00 GMT |
Overview
Concrete dataset class that aggregates golden test cases into a reusable evaluation collection. The EvaluationDataset is a Python dataclass that holds a list of Golden (or ConversationalGolden) objects and provides methods for dataset management, including synchronization with the Confident AI platform.
Description
The EvaluationDataset serves as the primary container for organizing evaluation test cases in DeepEval. It aggregates golden objects and provides the interface for batch evaluation, dataset persistence, and platform synchronization.
Key capabilities:
- Golden aggregation -- collects multiple
GoldenorConversationalGoldenobjects into a single dataset. - Batch evaluation support -- serves as the input to DeepEval's evaluation pipeline for running all test cases in a single operation.
- Platform synchronization -- supports pushing and pulling datasets to/from the Confident AI platform via the
confident_api_key. - Flexible construction -- accepts goldens at construction time or allows them to be added programmatically.
Usage
Import and construct an evaluation dataset:
from deepeval.dataset import EvaluationDataset
Code Reference
Source Location
- Repository:
confident-ai/deepeval - File:
deepeval/dataset/dataset.py(lines 68--107)
Signature
@dataclass
class EvaluationDataset:
goldens: Union[List[Golden], List[ConversationalGolden]] = field(default_factory=list)
confident_api_key: Optional[str] = None
Import
from deepeval.dataset import EvaluationDataset
Parent Class
- Python
dataclass(not a class hierarchy; uses the@dataclassdecorator)
I/O Contract
Inputs (Constructor Parameters)
| Name | Type | Default | Description |
|---|---|---|---|
goldens |
Union[List[Golden], List[ConversationalGolden]] | [] |
List of golden test cases that make up the evaluation dataset. Can contain either standard Golden objects or ConversationalGolden objects for multi-turn evaluation.
|
confident_api_key |
Optional[str] | None |
API key for the Confident AI platform. Used for pushing and pulling datasets to/from the platform. Falls back to environment variable if not provided. |
Outputs
| Name | Type | Description |
|---|---|---|
| Dataset object | EvaluationDataset | A dataset instance that can be passed to DeepEval's evaluation functions for batch evaluation. |
Usage Examples
Example 1: Basic Dataset Construction
Create a dataset from a list of golden test cases.
from deepeval.dataset import EvaluationDataset, Golden
dataset = EvaluationDataset(goldens=[
Golden(input="task1"),
Golden(input="task2"),
])
- The
goldensparameter accepts a list ofGoldenobjects. - The resulting dataset contains two test cases ready for evaluation.
Example 2: Dataset with Expected Outputs and Tools
Create a comprehensive evaluation dataset for agent testing.
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.test_case import ToolCall
dataset = EvaluationDataset(goldens=[
Golden(
input="What's the weather in Tokyo?",
expected_output="The current weather in Tokyo is...",
expected_tools=[ToolCall(name="get_weather")],
),
Golden(
input="Calculate 15% tip on $85",
expected_output="The tip amount is $12.75",
expected_tools=[ToolCall(name="calculator")],
),
Golden(
input="Find recent news about AI",
expected_output="Here are the latest AI news stories...",
expected_tools=[ToolCall(name="search")],
),
])
- Each golden test case specifies input, expected output, and expected tool calls.
- This dataset can be used with
TaskCompletionMetric,ToolUseMetric, and other agent metrics.
Example 3: Programmatic Dataset Construction
Build a dataset programmatically from an external data source.
import json
from deepeval.dataset import EvaluationDataset, Golden
with open("test_cases.json") as f:
raw_cases = json.load(f)
goldens = [
Golden(
input=case["input"],
expected_output=case.get("expected_output"),
context=case.get("context"),
)
for case in raw_cases
]
dataset = EvaluationDataset(goldens=goldens)
- Golden objects are constructed from external JSON data.
- The dataset aggregates all test cases for batch evaluation.