Implementation:Arize ai Phoenix Datasets Create Dataset

Knowledge Sources	Phoenix
Domains	AI Observability, Dataset Management, Evaluation Infrastructure
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tools for dataset creation, retrieval, listing, and augmentation provided by the Phoenix Client library.

Description

The Phoenix Client Datasets resource provides a suite of methods for managing versioned datasets on a Phoenix server. The primary methods are:

create_dataset(): Creates a new dataset by uploading examples from dictionaries, pandas DataFrames, or CSV files. Supports column-to-field mapping via key parameters and optional span ID linking for traceability.
get_dataset(): Retrieves a specific dataset with all its examples, supporting version pinning and split filtering.
list(): Lists all available datasets with automatic cursor-based pagination handling.
add_examples_to_dataset(): Appends new examples to an existing dataset, creating a new version.

All methods return or operate on Dataset objects, which provide a rich interface including iteration, indexing, DataFrame conversion, and dictionary serialization.

Usage

Use these methods when you need to programmatically create evaluation datasets from application data, retrieve existing datasets for experiment execution, enumerate available datasets, or incrementally build datasets by appending new examples as they are discovered or generated.

Code Reference

Source Location

Repository: Phoenix
File: packages/phoenix-client/src/phoenix/client/resources/datasets/__init__.py
Dataset class: Lines 75-303
create_dataset: Lines 746-876
get_dataset: Lines 483-567
list: Lines 672-744
add_examples_to_dataset: Lines 878-1002

Signature

create_dataset:

def create_dataset(
    self,
    *,
    name: str,
    examples: Optional[Union[Mapping[str, Any], Iterable[Mapping[str, Any]]]] = None,
    dataframe: Optional["pd.DataFrame"] = None,
    csv_file_path: Optional[Union[str, Path]] = None,
    input_keys: Iterable[str] = (),
    output_keys: Iterable[str] = (),
    metadata_keys: Iterable[str] = (),
    split_keys: Iterable[str] = (),
    span_id_key: Optional[str] = None,
    inputs: Iterable[Mapping[str, Any]] = (),
    outputs: Iterable[Mapping[str, Any]] = (),
    metadata: Iterable[Mapping[str, Any]] = (),
    dataset_description: Optional[str] = None,
    timeout: Optional[int] = 5,
) -> Dataset

get_dataset:

def get_dataset(
    self,
    *,
    dataset: DatasetIdentifier,
    version_id: Optional[str] = None,
    splits: Optional[Sequence[str]] = None,
    timeout: Optional[int] = 5,
) -> Dataset

list:

def list(
    self,
    *,
    limit: Optional[int] = None,
    timeout: Optional[float] = 5,
) -> list[v1.Dataset]

add_examples_to_dataset:

def add_examples_to_dataset(
    self,
    *,
    dataset: DatasetIdentifier,
    examples: Optional[Union[Mapping[str, Any], Iterable[Mapping[str, Any]]]] = None,
    dataframe: Optional["pd.DataFrame"] = None,
    csv_file_path: Optional[Union[str, Path]] = None,
    input_keys: Iterable[str] = (),
    output_keys: Iterable[str] = (),
    metadata_keys: Iterable[str] = (),
    split_keys: Iterable[str] = (),
    span_id_key: Optional[str] = None,
    inputs: Iterable[Mapping[str, Any]] = (),
    outputs: Iterable[Mapping[str, Any]] = (),
    metadata: Iterable[Mapping[str, Any]] = (),
    timeout: Optional[int] = 5,
) -> Dataset

Import

from phoenix.client import Client

client = Client()
# Access dataset methods via client.datasets
client.datasets.create_dataset(...)
client.datasets.get_dataset(...)
client.datasets.list()
client.datasets.add_examples_to_dataset(...)

I/O Contract

Inputs (create_dataset)

Name	Type	Required	Description
name	str	Yes	Name of the dataset to create.
examples	Optional[Union[Mapping[str, Any], Iterable[Mapping[str, Any]]]]	No	Single dict or iterable of dicts, each with required input and output keys and optional metadata key.
dataframe	Optional[pd.DataFrame]	No	Pandas DataFrame containing example data. Requires pandas to be installed.
csv_file_path	Optional[Union[str, Path]]	No	Path to a CSV file containing example data.
input_keys	Iterable[str]	No	Column names in the tabular data to map to the input field. Default: ().
output_keys	Iterable[str]	No	Column names in the tabular data to map to the output field. Default: ().
metadata_keys	Iterable[str]	No	Column names in the tabular data to map to the metadata field. Default: ().
split_keys	Iterable[str]	No	Column names used for automatic split assignment. Default: ().
span_id_key	Optional[str]	No	Column name containing OTEL span IDs to link examples to traces.
inputs	Iterable[Mapping[str, Any]]	No	List of input dicts, one per example. Alternative to examples parameter.
outputs	Iterable[Mapping[str, Any]]	No	List of output dicts, one per example. Alternative to examples parameter.
metadata	Iterable[Mapping[str, Any]]	No	List of metadata dicts, one per example. Alternative to examples parameter.
dataset_description	Optional[str]	No	Human-readable description for the dataset.
timeout	Optional[int]	No	Request timeout in seconds. Default: 5.

Inputs (get_dataset)

Name	Type	Required	Description
dataset	DatasetIdentifier	Yes	Dataset ID string, name string, Dataset object, or dict with id/name fields.
version_id	Optional[str]	No	Specific version ID. If None, returns the latest version.
splits	Optional[Sequence[str]]	No	List of split names to filter by. If provided, only returns matching examples.
timeout	Optional[int]	No	Request timeout in seconds. Default: 5.

Outputs

Name	Type	Description
Dataset	Dataset	Dataset object containing complete metadata and all examples for the requested version.

Dataset Object Properties

Property	Type	Description
id	str	The unique dataset identifier.
name	str	The dataset name.
description	Optional[str]	The dataset description.
version_id	str	The current version identifier.
examples	list[DatasetExample]	List of all examples in this version.
metadata	dict[str, Any]	Additional dataset metadata.
created_at	Optional[datetime]	When the dataset was created.
updated_at	Optional[datetime]	When the dataset was last updated.
example_count	int	Number of examples in this version.

Dataset Object Methods

Method	Return Type	Description
to_dataframe()	pd.DataFrame	Converts examples to a DataFrame indexed by example_id with columns: input, output, metadata.
to_dict()	dict[str, Any]	Converts the dataset to a JSON-serializable dictionary.
from_dict(json_data)	Dataset	Class method that creates a Dataset from a dictionary (e.g., from to_dict() output).
__len__()	int	Returns the number of examples.
__iter__()	Iterator[DatasetExample]	Iterates over examples.
__getitem__(index)	DatasetExample	Returns example by index.

Usage Examples

Create a Dataset from Dictionaries

from phoenix.client import Client

client = Client()

# Create a dataset using structured example dicts
dataset = client.datasets.create_dataset(
    name="qa-benchmark",
    examples=[
        {
            "input": {"question": "What is AI?"},
            "output": {"answer": "Artificial Intelligence is..."},
            "metadata": {"category": "definition"},
        },
        {
            "input": {"question": "Explain ML."},
            "output": {"answer": "Machine Learning is..."},
            "metadata": {"category": "definition"},
        },
    ],
    dataset_description="Question-answering benchmark dataset",
)

print(f"Created dataset: {dataset.name} with {len(dataset)} examples")

Create a Dataset from a DataFrame with Span Links

import pandas as pd
from phoenix.client import Client

client = Client()

df = pd.DataFrame({
    "question": ["What is AI?", "Explain ML"],
    "answer": ["Artificial Intelligence is...", "Machine Learning is..."],
    "context.span_id": ["abc123", "def456"],
})

dataset = client.datasets.create_dataset(
    name="traced-dataset",
    dataframe=df,
    input_keys=["question"],
    output_keys=["answer"],
    span_id_key="context.span_id",
)

Retrieve and Iterate Over a Dataset

from phoenix.client import Client

client = Client()

# Get dataset by name (latest version)
dataset = client.datasets.get_dataset(dataset="qa-benchmark")

# Iterate over examples
for example in dataset:
    print(f"Input: {example['input']}, Output: {example['output']}")

# Get a specific version
versioned = client.datasets.get_dataset(
    dataset="qa-benchmark",
    version_id="version-abc123",
)

# Filter by splits
train_data = client.datasets.get_dataset(
    dataset="qa-benchmark",
    splits=["train"],
)

List All Datasets

from phoenix.client import Client

client = Client()

# List all datasets (automatic pagination)
all_datasets = client.datasets.list()
for ds in all_datasets:
    print(f"{ds['name']}: {ds['example_count']} examples")

# List with a limit
limited = client.datasets.list(limit=10)

Append Examples to an Existing Dataset

from phoenix.client import Client

client = Client()

# Append new examples to create a new version
updated = client.datasets.add_examples_to_dataset(
    dataset="qa-benchmark",
    examples=[
        {
            "input": {"question": "What is deep learning?"},
            "output": {"answer": "Deep learning is a subset of ML..."},
        },
    ],
)

print(f"Updated dataset version: {updated.version_id}")
print(f"Total examples: {len(updated)}")

Convert Dataset to DataFrame

from phoenix.client import Client

client = Client()

dataset = client.datasets.get_dataset(dataset="qa-benchmark")
df = dataset.to_dataframe()
print(df.columns)   # Index(['input', 'output', 'metadata'], dtype='object')
print(df.index.name) # example_id

Related Pages

Implements Principle

Principle:Arize_ai_Phoenix_Dataset_Management

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment