Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset With Format

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, ML_Preprocessing
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for creating a new dataset view with a different output format without modifying the original provided by the HuggingFace Datasets library.

Description

The with_format method returns a new Dataset object with the specified output format, leaving the original dataset unchanged. Like set_format, it configures the __getitem__ return format to a framework-specific type (NumPy, PyTorch, TensorFlow, JAX, Arrow, pandas, or Polars). However, unlike set_format, it creates a deep copy of the dataset object with the new format settings, making it safe to use when the original dataset reference must remain unmodified. The copy shares the underlying Arrow data, so memory overhead is minimal.

Usage

Use Dataset.with_format when you need multiple format views of the same dataset, when writing library code that should not mutate its inputs, or when you prefer a functional style where operations return new objects rather than modifying existing ones.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L2725-L2794

Signature

def with_format(
    self,
    type: Optional[str] = None,
    columns: Optional[list] = None,
    output_all_columns: bool = False,
    **format_kwargs,
):

Import

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
torch_ds = ds.with_format("torch")

I/O Contract

Inputs

Name Type Required Description
type Optional[str] No Output format type: None (Python objects), "numpy", "torch", "tensorflow", "jax", "arrow", "pandas", or "polars". Defaults to None.
columns Optional[list] No Columns to include in formatted output. None means all columns. Defaults to None.
output_all_columns bool No Keep un-formatted columns in output as Python objects. Defaults to False.
**format_kwargs keyword arguments No Additional arguments passed to the conversion function.

Outputs

Name Type Description
return Dataset A new dataset object with the specified format. The original dataset is not modified.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")

# Create a PyTorch-formatted view
torch_ds = ds.with_format("torch")
print(torch_ds.format["type"])
# 'torch'

# Original dataset is unchanged
print(ds.format["type"])
# None

# Access returns tensors
example = torch_ds[0]
print(type(example["label"]))
# <class 'torch.Tensor'>

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment