Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Arize ai Phoenix Datasets Create Dataset

From Leeroopedia
Knowledge Sources
Domains AI Observability, Dataset Management, Evaluation Infrastructure
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tools for dataset creation, retrieval, listing, and augmentation provided by the Phoenix Client library.

Description

The Phoenix Client Datasets resource provides a suite of methods for managing versioned datasets on a Phoenix server. The primary methods are:

  • create_dataset(): Creates a new dataset by uploading examples from dictionaries, pandas DataFrames, or CSV files. Supports column-to-field mapping via key parameters and optional span ID linking for traceability.
  • get_dataset(): Retrieves a specific dataset with all its examples, supporting version pinning and split filtering.
  • list(): Lists all available datasets with automatic cursor-based pagination handling.
  • add_examples_to_dataset(): Appends new examples to an existing dataset, creating a new version.

All methods return or operate on Dataset objects, which provide a rich interface including iteration, indexing, DataFrame conversion, and dictionary serialization.

Usage

Use these methods when you need to programmatically create evaluation datasets from application data, retrieve existing datasets for experiment execution, enumerate available datasets, or incrementally build datasets by appending new examples as they are discovered or generated.

Code Reference

Source Location

  • Repository: Phoenix
  • File: packages/phoenix-client/src/phoenix/client/resources/datasets/__init__.py
  • Dataset class: Lines 75-303
  • create_dataset: Lines 746-876
  • get_dataset: Lines 483-567
  • list: Lines 672-744
  • add_examples_to_dataset: Lines 878-1002

Signature

create_dataset:

def create_dataset(
    self,
    *,
    name: str,
    examples: Optional[Union[Mapping[str, Any], Iterable[Mapping[str, Any]]]] = None,
    dataframe: Optional["pd.DataFrame"] = None,
    csv_file_path: Optional[Union[str, Path]] = None,
    input_keys: Iterable[str] = (),
    output_keys: Iterable[str] = (),
    metadata_keys: Iterable[str] = (),
    split_keys: Iterable[str] = (),
    span_id_key: Optional[str] = None,
    inputs: Iterable[Mapping[str, Any]] = (),
    outputs: Iterable[Mapping[str, Any]] = (),
    metadata: Iterable[Mapping[str, Any]] = (),
    dataset_description: Optional[str] = None,
    timeout: Optional[int] = 5,
) -> Dataset

get_dataset:

def get_dataset(
    self,
    *,
    dataset: DatasetIdentifier,
    version_id: Optional[str] = None,
    splits: Optional[Sequence[str]] = None,
    timeout: Optional[int] = 5,
) -> Dataset

list:

def list(
    self,
    *,
    limit: Optional[int] = None,
    timeout: Optional[float] = 5,
) -> list[v1.Dataset]

add_examples_to_dataset:

def add_examples_to_dataset(
    self,
    *,
    dataset: DatasetIdentifier,
    examples: Optional[Union[Mapping[str, Any], Iterable[Mapping[str, Any]]]] = None,
    dataframe: Optional["pd.DataFrame"] = None,
    csv_file_path: Optional[Union[str, Path]] = None,
    input_keys: Iterable[str] = (),
    output_keys: Iterable[str] = (),
    metadata_keys: Iterable[str] = (),
    split_keys: Iterable[str] = (),
    span_id_key: Optional[str] = None,
    inputs: Iterable[Mapping[str, Any]] = (),
    outputs: Iterable[Mapping[str, Any]] = (),
    metadata: Iterable[Mapping[str, Any]] = (),
    timeout: Optional[int] = 5,
) -> Dataset

Import

from phoenix.client import Client

client = Client()
# Access dataset methods via client.datasets
client.datasets.create_dataset(...)
client.datasets.get_dataset(...)
client.datasets.list()
client.datasets.add_examples_to_dataset(...)

I/O Contract

Inputs (create_dataset)

Name Type Required Description
name str Yes Name of the dataset to create.
examples Optional[Union[Mapping[str, Any], Iterable[Mapping[str, Any]]]] No Single dict or iterable of dicts, each with required input and output keys and optional metadata key.
dataframe Optional[pd.DataFrame] No Pandas DataFrame containing example data. Requires pandas to be installed.
csv_file_path Optional[Union[str, Path]] No Path to a CSV file containing example data.
input_keys Iterable[str] No Column names in the tabular data to map to the input field. Default: ().
output_keys Iterable[str] No Column names in the tabular data to map to the output field. Default: ().
metadata_keys Iterable[str] No Column names in the tabular data to map to the metadata field. Default: ().
split_keys Iterable[str] No Column names used for automatic split assignment. Default: ().
span_id_key Optional[str] No Column name containing OTEL span IDs to link examples to traces.
inputs Iterable[Mapping[str, Any]] No List of input dicts, one per example. Alternative to examples parameter.
outputs Iterable[Mapping[str, Any]] No List of output dicts, one per example. Alternative to examples parameter.
metadata Iterable[Mapping[str, Any]] No List of metadata dicts, one per example. Alternative to examples parameter.
dataset_description Optional[str] No Human-readable description for the dataset.
timeout Optional[int] No Request timeout in seconds. Default: 5.

Inputs (get_dataset)

Name Type Required Description
dataset DatasetIdentifier Yes Dataset ID string, name string, Dataset object, or dict with id/name fields.
version_id Optional[str] No Specific version ID. If None, returns the latest version.
splits Optional[Sequence[str]] No List of split names to filter by. If provided, only returns matching examples.
timeout Optional[int] No Request timeout in seconds. Default: 5.

Outputs

Name Type Description
Dataset Dataset Dataset object containing complete metadata and all examples for the requested version.

Dataset Object Properties

Property Type Description
id str The unique dataset identifier.
name str The dataset name.
description Optional[str] The dataset description.
version_id str The current version identifier.
examples list[DatasetExample] List of all examples in this version.
metadata dict[str, Any] Additional dataset metadata.
created_at Optional[datetime] When the dataset was created.
updated_at Optional[datetime] When the dataset was last updated.
example_count int Number of examples in this version.

Dataset Object Methods

Method Return Type Description
to_dataframe() pd.DataFrame Converts examples to a DataFrame indexed by example_id with columns: input, output, metadata.
to_dict() dict[str, Any] Converts the dataset to a JSON-serializable dictionary.
from_dict(json_data) Dataset Class method that creates a Dataset from a dictionary (e.g., from to_dict() output).
__len__() int Returns the number of examples.
__iter__() Iterator[DatasetExample] Iterates over examples.
__getitem__(index) DatasetExample Returns example by index.

Usage Examples

Create a Dataset from Dictionaries

from phoenix.client import Client

client = Client()

# Create a dataset using structured example dicts
dataset = client.datasets.create_dataset(
    name="qa-benchmark",
    examples=[
        {
            "input": {"question": "What is AI?"},
            "output": {"answer": "Artificial Intelligence is..."},
            "metadata": {"category": "definition"},
        },
        {
            "input": {"question": "Explain ML."},
            "output": {"answer": "Machine Learning is..."},
            "metadata": {"category": "definition"},
        },
    ],
    dataset_description="Question-answering benchmark dataset",
)

print(f"Created dataset: {dataset.name} with {len(dataset)} examples")

Create a Dataset from a DataFrame with Span Links

import pandas as pd
from phoenix.client import Client

client = Client()

df = pd.DataFrame({
    "question": ["What is AI?", "Explain ML"],
    "answer": ["Artificial Intelligence is...", "Machine Learning is..."],
    "context.span_id": ["abc123", "def456"],
})

dataset = client.datasets.create_dataset(
    name="traced-dataset",
    dataframe=df,
    input_keys=["question"],
    output_keys=["answer"],
    span_id_key="context.span_id",
)

Retrieve and Iterate Over a Dataset

from phoenix.client import Client

client = Client()

# Get dataset by name (latest version)
dataset = client.datasets.get_dataset(dataset="qa-benchmark")

# Iterate over examples
for example in dataset:
    print(f"Input: {example['input']}, Output: {example['output']}")

# Get a specific version
versioned = client.datasets.get_dataset(
    dataset="qa-benchmark",
    version_id="version-abc123",
)

# Filter by splits
train_data = client.datasets.get_dataset(
    dataset="qa-benchmark",
    splits=["train"],
)

List All Datasets

from phoenix.client import Client

client = Client()

# List all datasets (automatic pagination)
all_datasets = client.datasets.list()
for ds in all_datasets:
    print(f"{ds['name']}: {ds['example_count']} examples")

# List with a limit
limited = client.datasets.list(limit=10)

Append Examples to an Existing Dataset

from phoenix.client import Client

client = Client()

# Append new examples to create a new version
updated = client.datasets.add_examples_to_dataset(
    dataset="qa-benchmark",
    examples=[
        {
            "input": {"question": "What is deep learning?"},
            "output": {"answer": "Deep learning is a subset of ML..."},
        },
    ],
)

print(f"Updated dataset version: {updated.version_id}")
print(f"Total examples: {len(updated)}")

Convert Dataset to DataFrame

from phoenix.client import Client

client = Client()

dataset = client.datasets.get_dataset(dataset="qa-benchmark")
df = dataset.to_dataframe()
print(df.columns)   # Index(['input', 'output', 'metadata'], dtype='object')
print(df.index.name) # example_id

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment