Implementation:Huggingface Datasets Dataset To Pandas
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for converting datasets to Pandas DataFrames for interactive exploration and analysis provided by the HuggingFace Datasets library.
Description
Dataset.to_pandas converts the dataset (or portions of it) into a pandas.DataFrame. In non-batched mode (the default), it queries the entire underlying Arrow table (respecting any indices mapping), then calls Arrow's native to_pandas(types_mapper=pandas_types_mapper) to produce a single DataFrame with appropriate type conversions. In batched mode, it returns a generator that yields DataFrames of batch_size rows each, enabling processing of large datasets that do not fit in memory as a single DataFrame. The pandas_types_mapper ensures that Arrow types are mapped to the most appropriate Pandas equivalents.
Usage
Call dataset.to_pandas() when you want to work with the data using Pandas' API for exploration, analysis, or visualization. Use batched=True when the dataset is too large to fit in memory as a single DataFrame.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L5159-L5196
Signature
def to_pandas(
self, batch_size: Optional[int] = None, batched: bool = False
) -> Union[pd.DataFrame, Iterator[pd.DataFrame]]:
Import
from datasets import load_dataset
ds = load_dataset("dataset_name", split="train")
# Access as a method:
df = ds.to_pandas()
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| batch_size | int |
No | The number of rows per batch when batched=True. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE.
|
| batched | bool |
No | If True, returns a generator yielding DataFrames of batch_size rows. Defaults to False (returns the whole dataset at once).
|
Outputs
| Name | Type | Description |
|---|---|---|
| (return value, non-batched) | pandas.DataFrame |
A single DataFrame containing all rows and columns of the dataset. |
| (return value, batched) | Iterator[pandas.DataFrame] |
A generator yielding DataFrames of batch_size rows each.
|
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
# Convert entire dataset to a Pandas DataFrame
df = ds.to_pandas()
print(df.head())
# text label
# 0 the rock is destined to be the 21st century's ... 1
# 1 the gorgeously elaborate continuation of } the... 1
# 2 effective but too-tepid biopic 1
# 3 if you sometimes like to go to the movies to h... 1
# 4 emerges as something rare , an issue movie tha... 1
# Use Pandas API for analysis
print(df["label"].value_counts())
print(df.describe())
Batched Conversion for Large Datasets
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
# Process in batches for memory-efficient analysis
for batch_df in ds.to_pandas(batch_size=1000, batched=True):
print(f"Batch shape: {batch_df.shape}")
# Process each batch independently