Implementation:Huggingface Datasets Dataset To Pandas

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for converting datasets to Pandas DataFrames for interactive exploration and analysis provided by the HuggingFace Datasets library.

Description

Dataset.to_pandas converts the dataset (or portions of it) into a pandas.DataFrame. In non-batched mode (the default), it queries the entire underlying Arrow table (respecting any indices mapping), then calls Arrow's native to_pandas(types_mapper=pandas_types_mapper) to produce a single DataFrame with appropriate type conversions. In batched mode, it returns a generator that yields DataFrames of batch_size rows each, enabling processing of large datasets that do not fit in memory as a single DataFrame. The pandas_types_mapper ensures that Arrow types are mapped to the most appropriate Pandas equivalents.

Usage

Call dataset.to_pandas() when you want to work with the data using Pandas' API for exploration, analysis, or visualization. Use batched=True when the dataset is too large to fit in memory as a single DataFrame.

Code Reference

Source Location

Repository: datasets
File: src/datasets/arrow_dataset.py
Lines: L5159-L5196

Signature

def to_pandas(
    self, batch_size: Optional[int] = None, batched: bool = False
) -> Union[pd.DataFrame, Iterator[pd.DataFrame]]:

Import

from datasets import load_dataset
ds = load_dataset("dataset_name", split="train")
# Access as a method:
df = ds.to_pandas()

I/O Contract

Inputs

Name	Type	Required	Description
batch_size	`int`	No	The number of rows per batch when `batched=True`. Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`.
batched	`bool`	No	If `True`, returns a generator yielding DataFrames of `batch_size` rows. Defaults to `False` (returns the whole dataset at once).

Outputs

Name	Type	Description
(return value, non-batched)	`pandas.DataFrame`	A single DataFrame containing all rows and columns of the dataset.
(return value, batched)	`Iterator[pandas.DataFrame]`	A generator yielding DataFrames of `batch_size` rows each.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")

# Convert entire dataset to a Pandas DataFrame
df = ds.to_pandas()
print(df.head())
#                                                 text  label
# 0  the rock is destined to be the 21st century's ...      1
# 1  the gorgeously elaborate continuation of } the...      1
# 2  effective but too-tepid biopic                          1
# 3  if you sometimes like to go to the movies to h...      1
# 4  emerges as something rare , an issue movie tha...      1

# Use Pandas API for analysis
print(df["label"].value_counts())
print(df.describe())

Batched Conversion for Large Datasets

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")

# Process in batches for memory-efficient analysis
for batch_df in ds.to_pandas(batch_size=1000, batched=True):
    print(f"Batch shape: {batch_df.shape}")
    # Process each batch independently

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Pandas_Conversion

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment