Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset From Pandas

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for creating a Dataset from a Pandas DataFrame provided by the HuggingFace Datasets library.

Description

Dataset.from_pandas is a class method that converts a pandas.DataFrame into a PyArrow Table and wraps it in a Dataset. Column types are inferred from the DataFrame's dtypes using PyArrow's Pandas integration. For object-typed Series, Python objects are inspected to determine the Arrow type. When the DataFrame is empty or contains only None values, the type defaults to null unless explicit features are provided. An optional preserve_index parameter controls whether the DataFrame index is stored as a column.

Usage

Use Dataset.from_pandas when you have tabular data in a Pandas DataFrame and need to convert it for use with the HuggingFace Datasets ecosystem, including training, evaluation, or Hub uploads.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: 859-926

Signature

@classmethod
def from_pandas(
    cls,
    df: pd.DataFrame,
    features: Optional[Features] = None,
    info: Optional[DatasetInfo] = None,
    split: Optional[NamedSplit] = None,
    preserve_index: Optional[bool] = None,
) -> "Dataset":

Import

from datasets import Dataset

I/O Contract

Inputs

Name Type Required Description
df pd.DataFrame Yes The Pandas DataFrame containing the dataset.
features Features No Explicit dataset features schema for type casting.
info DatasetInfo No Dataset metadata (description, citation, etc.).
split NamedSplit No Name of the dataset split.
preserve_index bool No Whether to store the index as a column. Default None stores all indexes except RangeIndex.

Outputs

Name Type Description
return Dataset A new in-memory Dataset backed by an Arrow table converted from the DataFrame.

Usage Examples

Basic Usage

import pandas as pd
from datasets import Dataset

df = pd.DataFrame({
    "text": ["Hello world", "Goodbye world"],
    "label": [1, 0],
})
ds = Dataset.from_pandas(df)
print(ds)
# Dataset({
#     features: ['text', 'label'],
#     num_rows: 2
# })

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment