Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset To Pandas

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for converting datasets to Pandas DataFrames for interactive exploration and analysis provided by the HuggingFace Datasets library.

Description

Dataset.to_pandas converts the dataset (or portions of it) into a pandas.DataFrame. In non-batched mode (the default), it queries the entire underlying Arrow table (respecting any indices mapping), then calls Arrow's native to_pandas(types_mapper=pandas_types_mapper) to produce a single DataFrame with appropriate type conversions. In batched mode, it returns a generator that yields DataFrames of batch_size rows each, enabling processing of large datasets that do not fit in memory as a single DataFrame. The pandas_types_mapper ensures that Arrow types are mapped to the most appropriate Pandas equivalents.

Usage

Call dataset.to_pandas() when you want to work with the data using Pandas' API for exploration, analysis, or visualization. Use batched=True when the dataset is too large to fit in memory as a single DataFrame.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L5159-L5196

Signature

def to_pandas(
    self, batch_size: Optional[int] = None, batched: bool = False
) -> Union[pd.DataFrame, Iterator[pd.DataFrame]]:

Import

from datasets import load_dataset
ds = load_dataset("dataset_name", split="train")
# Access as a method:
df = ds.to_pandas()

I/O Contract

Inputs

Name Type Required Description
batch_size int No The number of rows per batch when batched=True. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE.
batched bool No If True, returns a generator yielding DataFrames of batch_size rows. Defaults to False (returns the whole dataset at once).

Outputs

Name Type Description
(return value, non-batched) pandas.DataFrame A single DataFrame containing all rows and columns of the dataset.
(return value, batched) Iterator[pandas.DataFrame] A generator yielding DataFrames of batch_size rows each.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")

# Convert entire dataset to a Pandas DataFrame
df = ds.to_pandas()
print(df.head())
#                                                 text  label
# 0  the rock is destined to be the 21st century's ...      1
# 1  the gorgeously elaborate continuation of } the...      1
# 2  effective but too-tepid biopic                          1
# 3  if you sometimes like to go to the movies to h...      1
# 4  emerges as something rare , an issue movie tha...      1

# Use Pandas API for analysis
print(df["label"].value_counts())
print(df.describe())

Batched Conversion for Large Datasets

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")

# Process in batches for memory-efficient analysis
for batch_df in ds.to_pandas(batch_size=1000, batched=True):
    print(f"Batch shape: {batch_df.shape}")
    # Process each batch independently

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment