Implementation:Huggingface Datasets Dataset With Format For Tensors

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for configuring a HuggingFace Dataset to return ML framework tensors on data access provided by the HuggingFace Datasets library.

Description

Dataset.with_format is a method that returns a new Dataset object with an on-the-fly formatting layer applied. When the type parameter is set to a framework name ("torch", "tensorflow", "jax", or "numpy"), every subsequent __getitem__ call converts the underlying Arrow data to tensors of that framework. The method creates a deep copy of the dataset (so the original is unmodified), then applies set_format on the copy. Column selection (columns) and output-all-columns (output_all_columns) options provide fine-grained control over which columns are converted and whether unconverted columns are included.

Usage

Use Dataset.with_format when you need to produce framework-specific tensors for model training or evaluation. This is the primary entry point for tensor conversion and is commonly combined with PyTorch DataLoaders or TensorFlow tf.data pipelines.

Code Reference

Source Location

Repository: datasets
File: src/datasets/arrow_dataset.py
Lines: L2725-L2794

Signature

def with_format(
    self,
    type: Optional[str] = None,
    columns: Optional[list] = None,
    output_all_columns: bool = False,
    **format_kwargs,
):

Import

from datasets import Dataset
# with_format is a method on Dataset instances

I/O Contract

Inputs

Name	Type	Required	Description
type	`Optional[str]`	No	Output format type. One of: None, "numpy", "torch", "tensorflow", "jax", "arrow", "pandas", "polars". None returns Python objects (default).
columns	`Optional[list]`	No	List of column names to include in the formatted output. None includes all columns (default).
output_all_columns	`bool`	No	If True, keep unconverted columns as Python objects alongside the formatted columns. Defaults to False.
**format_kwargs		No	Additional keyword arguments passed to the underlying tensor conversion function (e.g., dtype for np.array or torch.tensor).

Outputs

Name	Type	Description
dataset	`Dataset`	A new Dataset object with the formatting layer applied. The underlying Arrow data is shared.

Usage Examples

Basic Usage

from datasets import load_dataset
from transformers import AutoTokenizer

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
ds = ds.map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True)

# Convert to PyTorch tensors
ds_torch = ds.with_format("torch")
print(ds_torch[0]["input_ids"])  # torch.Tensor

# Convert to NumPy arrays
ds_numpy = ds.with_format("numpy")
print(ds_numpy[0]["input_ids"])  # np.ndarray

# Convert specific columns to TensorFlow tensors
ds_tf = ds.with_format("tensorflow", columns=["input_ids", "attention_mask"])
print(ds_tf[0]["input_ids"])  # tf.Tensor

# Convert to JAX arrays
ds_jax = ds.with_format("jax")
print(ds_jax[0]["input_ids"])  # jax.Array

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Framework_Tensor_Conversion

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment