Implementation:Huggingface Datasets Dataset With Format
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ML_Preprocessing |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for creating a new dataset view with a different output format without modifying the original provided by the HuggingFace Datasets library.
Description
The with_format method returns a new Dataset object with the specified output format, leaving the original dataset unchanged. Like set_format, it configures the __getitem__ return format to a framework-specific type (NumPy, PyTorch, TensorFlow, JAX, Arrow, pandas, or Polars). However, unlike set_format, it creates a deep copy of the dataset object with the new format settings, making it safe to use when the original dataset reference must remain unmodified. The copy shares the underlying Arrow data, so memory overhead is minimal.
Usage
Use Dataset.with_format when you need multiple format views of the same dataset, when writing library code that should not mutate its inputs, or when you prefer a functional style where operations return new objects rather than modifying existing ones.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L2725-L2794
Signature
def with_format(
self,
type: Optional[str] = None,
columns: Optional[list] = None,
output_all_columns: bool = False,
**format_kwargs,
):
Import
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
torch_ds = ds.with_format("torch")
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| type | Optional[str] |
No | Output format type: None (Python objects), "numpy", "torch", "tensorflow", "jax", "arrow", "pandas", or "polars". Defaults to None.
|
| columns | Optional[list] |
No | Columns to include in formatted output. None means all columns. Defaults to None.
|
| output_all_columns | bool |
No | Keep un-formatted columns in output as Python objects. Defaults to False.
|
| **format_kwargs | keyword arguments | No | Additional arguments passed to the conversion function. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | Dataset |
A new dataset object with the specified format. The original dataset is not modified. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
# Create a PyTorch-formatted view
torch_ds = ds.with_format("torch")
print(torch_ds.format["type"])
# 'torch'
# Original dataset is unchanged
print(ds.format["type"])
# None
# Access returns tensors
example = torch_ds[0]
print(type(example["label"]))
# <class 'torch.Tensor'>