Implementation:Huggingface Datasets Dataset From Pandas
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for creating a Dataset from a Pandas DataFrame provided by the HuggingFace Datasets library.
Description
Dataset.from_pandas is a class method that converts a pandas.DataFrame into a PyArrow Table and wraps it in a Dataset. Column types are inferred from the DataFrame's dtypes using PyArrow's Pandas integration. For object-typed Series, Python objects are inspected to determine the Arrow type. When the DataFrame is empty or contains only None values, the type defaults to null unless explicit features are provided. An optional preserve_index parameter controls whether the DataFrame index is stored as a column.
Usage
Use Dataset.from_pandas when you have tabular data in a Pandas DataFrame and need to convert it for use with the HuggingFace Datasets ecosystem, including training, evaluation, or Hub uploads.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: 859-926
Signature
@classmethod
def from_pandas(
cls,
df: pd.DataFrame,
features: Optional[Features] = None,
info: Optional[DatasetInfo] = None,
split: Optional[NamedSplit] = None,
preserve_index: Optional[bool] = None,
) -> "Dataset":
Import
from datasets import Dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| df | pd.DataFrame |
Yes | The Pandas DataFrame containing the dataset. |
| features | Features |
No | Explicit dataset features schema for type casting. |
| info | DatasetInfo |
No | Dataset metadata (description, citation, etc.). |
| split | NamedSplit |
No | Name of the dataset split. |
| preserve_index | bool |
No | Whether to store the index as a column. Default None stores all indexes except RangeIndex. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | Dataset |
A new in-memory Dataset backed by an Arrow table converted from the DataFrame. |
Usage Examples
Basic Usage
import pandas as pd
from datasets import Dataset
df = pd.DataFrame({
"text": ["Hello world", "Goodbye world"],
"label": [1, 0],
})
ds = Dataset.from_pandas(df)
print(ds)
# Dataset({
# features: ['text', 'label'],
# num_rows: 2
# })