Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset Set Format

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, ML_Preprocessing
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for setting the output format of a dataset in-place provided by the HuggingFace Datasets library.

Description

The set_format method configures the dataset's __getitem__ return format in-place. After calling this method, accessing dataset elements returns data in the specified framework format (e.g., NumPy arrays, PyTorch tensors, TensorFlow tensors, JAX arrays, Arrow tables, pandas DataFrames, or Polars DataFrames). The formatting is applied on-the-fly during data access and does not modify the underlying Arrow data. You can optionally restrict which columns are included in the formatted output. The format can be reset to Python objects by calling set_format() with no arguments or using reset_format().

Usage

Use Dataset.set_format when you want to configure a dataset for a training loop that expects a specific tensor format, and you want to set the format once on the existing dataset object without creating a copy.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L2574-L2651

Signature

@fingerprint_transform(inplace=True)
def set_format(
    self,
    type: Optional[str] = None,
    columns: Optional[list] = None,
    output_all_columns: bool = False,
    **format_kwargs,
):

Import

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
ds.set_format(type="numpy", columns=["text", "label"])

I/O Contract

Inputs

Name Type Required Description
type Optional[str] No Output format type: None (Python objects), "numpy", "torch", "tensorflow", "jax", "arrow", "pandas", or "polars". Defaults to None.
columns Optional[list] No Columns to include in formatted output. None means all columns. Defaults to None.
output_all_columns bool No Keep un-formatted columns in output as Python objects. Defaults to False.
**format_kwargs keyword arguments No Additional arguments passed to the conversion function (e.g., np.array, torch.tensor).

Outputs

Name Type Description
return None This method modifies the dataset in-place and returns nothing.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")

# Set format to NumPy for specific columns
ds.set_format(type="numpy", columns=["text", "label"])
print(ds.format)
# {'type': 'numpy', 'format_kwargs': {}, 'columns': ['text', 'label'], 'output_all_columns': False}

# Reset to default Python objects
ds.reset_format()
print(ds.format["type"])
# None

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment