Implementation:Huggingface Datasets Dataset Set Format

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, ML_Preprocessing
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for setting the output format of a dataset in-place provided by the HuggingFace Datasets library.

Description

The set_format method configures the dataset's __getitem__ return format in-place. After calling this method, accessing dataset elements returns data in the specified framework format (e.g., NumPy arrays, PyTorch tensors, TensorFlow tensors, JAX arrays, Arrow tables, pandas DataFrames, or Polars DataFrames). The formatting is applied on-the-fly during data access and does not modify the underlying Arrow data. You can optionally restrict which columns are included in the formatted output. The format can be reset to Python objects by calling set_format() with no arguments or using reset_format().

Usage

Use Dataset.set_format when you want to configure a dataset for a training loop that expects a specific tensor format, and you want to set the format once on the existing dataset object without creating a copy.

Code Reference

Source Location

Repository: datasets
File: src/datasets/arrow_dataset.py
Lines: L2574-L2651

Signature

@fingerprint_transform(inplace=True)
def set_format(
    self,
    type: Optional[str] = None,
    columns: Optional[list] = None,
    output_all_columns: bool = False,
    **format_kwargs,
):

Import

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
ds.set_format(type="numpy", columns=["text", "label"])

I/O Contract

Inputs

Name	Type	Required	Description
type	`Optional[str]`	No	Output format type: `None` (Python objects), `"numpy"`, `"torch"`, `"tensorflow"`, `"jax"`, `"arrow"`, `"pandas"`, or `"polars"`. Defaults to `None`.
columns	`Optional[list]`	No	Columns to include in formatted output. `None` means all columns. Defaults to `None`.
output_all_columns	`bool`	No	Keep un-formatted columns in output as Python objects. Defaults to `False`.
**format_kwargs	keyword arguments	No	Additional arguments passed to the conversion function (e.g., `np.array`, `torch.tensor`).

Outputs

Name	Type	Description
return	`None`	This method modifies the dataset in-place and returns nothing.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")

# Set format to NumPy for specific columns
ds.set_format(type="numpy", columns=["text", "label"])
print(ds.format)
# {'type': 'numpy', 'format_kwargs': {}, 'columns': ['text', 'label'], 'output_all_columns': False}

# Reset to default Python objects
ds.reset_format()
print(ds.format["type"])
# None

Related Pages

Implements Principle

Principle:Huggingface_Datasets_In_Place_Format_Setting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment