Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Confident ai Deepeval EvaluationDataset

From Leeroopedia
Metadata
Knowledge Sources
Domains
Last Updated 2026-02-14 09:00 GMT

Overview

Concrete dataset class that aggregates golden test cases into a reusable evaluation collection. The EvaluationDataset is a Python dataclass that holds a list of Golden (or ConversationalGolden) objects and provides methods for dataset management, including synchronization with the Confident AI platform.

Description

The EvaluationDataset serves as the primary container for organizing evaluation test cases in DeepEval. It aggregates golden objects and provides the interface for batch evaluation, dataset persistence, and platform synchronization.

Key capabilities:

  • Golden aggregation -- collects multiple Golden or ConversationalGolden objects into a single dataset.
  • Batch evaluation support -- serves as the input to DeepEval's evaluation pipeline for running all test cases in a single operation.
  • Platform synchronization -- supports pushing and pulling datasets to/from the Confident AI platform via the confident_api_key.
  • Flexible construction -- accepts goldens at construction time or allows them to be added programmatically.

Usage

Import and construct an evaluation dataset:

from deepeval.dataset import EvaluationDataset

Code Reference

Source Location

  • Repository: confident-ai/deepeval
  • File: deepeval/dataset/dataset.py (lines 68--107)

Signature

@dataclass
class EvaluationDataset:
    goldens: Union[List[Golden], List[ConversationalGolden]] = field(default_factory=list)
    confident_api_key: Optional[str] = None

Import

from deepeval.dataset import EvaluationDataset

Parent Class

  • Python dataclass (not a class hierarchy; uses the @dataclass decorator)

I/O Contract

Inputs (Constructor Parameters)

Input Contract
Name Type Default Description
goldens Union[List[Golden], List[ConversationalGolden]] [] List of golden test cases that make up the evaluation dataset. Can contain either standard Golden objects or ConversationalGolden objects for multi-turn evaluation.
confident_api_key Optional[str] None API key for the Confident AI platform. Used for pushing and pulling datasets to/from the platform. Falls back to environment variable if not provided.

Outputs

Output Contract
Name Type Description
Dataset object EvaluationDataset A dataset instance that can be passed to DeepEval's evaluation functions for batch evaluation.

Usage Examples

Example 1: Basic Dataset Construction

Create a dataset from a list of golden test cases.

from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset(goldens=[
    Golden(input="task1"),
    Golden(input="task2"),
])
  • The goldens parameter accepts a list of Golden objects.
  • The resulting dataset contains two test cases ready for evaluation.

Example 2: Dataset with Expected Outputs and Tools

Create a comprehensive evaluation dataset for agent testing.

from deepeval.dataset import EvaluationDataset, Golden
from deepeval.test_case import ToolCall

dataset = EvaluationDataset(goldens=[
    Golden(
        input="What's the weather in Tokyo?",
        expected_output="The current weather in Tokyo is...",
        expected_tools=[ToolCall(name="get_weather")],
    ),
    Golden(
        input="Calculate 15% tip on $85",
        expected_output="The tip amount is $12.75",
        expected_tools=[ToolCall(name="calculator")],
    ),
    Golden(
        input="Find recent news about AI",
        expected_output="Here are the latest AI news stories...",
        expected_tools=[ToolCall(name="search")],
    ),
])
  • Each golden test case specifies input, expected output, and expected tool calls.
  • This dataset can be used with TaskCompletionMetric, ToolUseMetric, and other agent metrics.

Example 3: Programmatic Dataset Construction

Build a dataset programmatically from an external data source.

import json
from deepeval.dataset import EvaluationDataset, Golden

with open("test_cases.json") as f:
    raw_cases = json.load(f)

goldens = [
    Golden(
        input=case["input"],
        expected_output=case.get("expected_output"),
        context=case.get("context"),
    )
    for case in raw_cases
]

dataset = EvaluationDataset(goldens=goldens)
  • Golden objects are constructed from external JSON data.
  • The dataset aggregates all test cases for batch evaluation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment