Implementation:Huggingface Datasets Dataset With Format For Tensors
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for configuring a HuggingFace Dataset to return ML framework tensors on data access provided by the HuggingFace Datasets library.
Description
Dataset.with_format is a method that returns a new Dataset object with an on-the-fly formatting layer applied. When the type parameter is set to a framework name ("torch", "tensorflow", "jax", or "numpy"), every subsequent __getitem__ call converts the underlying Arrow data to tensors of that framework. The method creates a deep copy of the dataset (so the original is unmodified), then applies set_format on the copy. Column selection (columns) and output-all-columns (output_all_columns) options provide fine-grained control over which columns are converted and whether unconverted columns are included.
Usage
Use Dataset.with_format when you need to produce framework-specific tensors for model training or evaluation. This is the primary entry point for tensor conversion and is commonly combined with PyTorch DataLoaders or TensorFlow tf.data pipelines.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L2725-L2794
Signature
def with_format(
self,
type: Optional[str] = None,
columns: Optional[list] = None,
output_all_columns: bool = False,
**format_kwargs,
):
Import
from datasets import Dataset
# with_format is a method on Dataset instances
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| type | Optional[str] |
No | Output format type. One of: None, "numpy", "torch", "tensorflow", "jax", "arrow", "pandas", "polars". None returns Python objects (default). |
| columns | Optional[list] |
No | List of column names to include in the formatted output. None includes all columns (default). |
| output_all_columns | bool |
No | If True, keep unconverted columns as Python objects alongside the formatted columns. Defaults to False. |
| **format_kwargs | No | Additional keyword arguments passed to the underlying tensor conversion function (e.g., dtype for np.array or torch.tensor). |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset |
A new Dataset object with the formatting layer applied. The underlying Arrow data is shared. |
Usage Examples
Basic Usage
from datasets import load_dataset
from transformers import AutoTokenizer
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
ds = ds.map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True)
# Convert to PyTorch tensors
ds_torch = ds.with_format("torch")
print(ds_torch[0]["input_ids"]) # torch.Tensor
# Convert to NumPy arrays
ds_numpy = ds.with_format("numpy")
print(ds_numpy[0]["input_ids"]) # np.ndarray
# Convert specific columns to TensorFlow tensors
ds_tf = ds.with_format("tensorflow", columns=["input_ids", "attention_mask"])
print(ds_tf[0]["input_ids"]) # tf.Tensor
# Convert to JAX arrays
ds_jax = ds.with_format("jax")
print(ds_jax[0]["input_ids"]) # jax.Array