Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset With Format For Tensors

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for configuring a HuggingFace Dataset to return ML framework tensors on data access provided by the HuggingFace Datasets library.

Description

Dataset.with_format is a method that returns a new Dataset object with an on-the-fly formatting layer applied. When the type parameter is set to a framework name ("torch", "tensorflow", "jax", or "numpy"), every subsequent __getitem__ call converts the underlying Arrow data to tensors of that framework. The method creates a deep copy of the dataset (so the original is unmodified), then applies set_format on the copy. Column selection (columns) and output-all-columns (output_all_columns) options provide fine-grained control over which columns are converted and whether unconverted columns are included.

Usage

Use Dataset.with_format when you need to produce framework-specific tensors for model training or evaluation. This is the primary entry point for tensor conversion and is commonly combined with PyTorch DataLoaders or TensorFlow tf.data pipelines.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L2725-L2794

Signature

def with_format(
    self,
    type: Optional[str] = None,
    columns: Optional[list] = None,
    output_all_columns: bool = False,
    **format_kwargs,
):

Import

from datasets import Dataset
# with_format is a method on Dataset instances

I/O Contract

Inputs

Name Type Required Description
type Optional[str] No Output format type. One of: None, "numpy", "torch", "tensorflow", "jax", "arrow", "pandas", "polars". None returns Python objects (default).
columns Optional[list] No List of column names to include in the formatted output. None includes all columns (default).
output_all_columns bool No If True, keep unconverted columns as Python objects alongside the formatted columns. Defaults to False.
**format_kwargs No Additional keyword arguments passed to the underlying tensor conversion function (e.g., dtype for np.array or torch.tensor).

Outputs

Name Type Description
dataset Dataset A new Dataset object with the formatting layer applied. The underlying Arrow data is shared.

Usage Examples

Basic Usage

from datasets import load_dataset
from transformers import AutoTokenizer

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
ds = ds.map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True)

# Convert to PyTorch tensors
ds_torch = ds.with_format("torch")
print(ds_torch[0]["input_ids"])  # torch.Tensor

# Convert to NumPy arrays
ds_numpy = ds.with_format("numpy")
print(ds_numpy[0]["input_ids"])  # np.ndarray

# Convert specific columns to TensorFlow tensors
ds_tf = ds.with_format("tensorflow", columns=["input_ids", "attention_mask"])
print(ds_tf[0]["input_ids"])  # tf.Tensor

# Convert to JAX arrays
ds_jax = ds.with_format("jax")
print(ds_jax[0]["input_ids"])  # jax.Array

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment