Implementation:Huggingface Datasets IterableDataset With Format
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for configuring the output format of streamed dataset elements provided by the HuggingFace Datasets library.
Description
IterableDataset.with_format returns a new IterableDataset that is configured to convert each yielded element to the specified format type. The method does not modify the underlying data or iterable pipeline; it only sets a FormattingConfig object on the new dataset instance.
Internally, the method:
- Resolves the format type string via
get_format_type_from_alias(type), which normalizes aliases (e.g.,"pt"to"torch"). - Creates a new
FormattingConfig(format_type=type). - Returns a new
IterableDatasetwith the same_ex_iterable, info, and split, but with the new formatting configuration. - The formatting is applied during iteration: in
__iter__, if a formatting config is present, a format-specificFormatteris instantiated, and each element is passed throughformatter.format_row(pa_table).
The formatting configuration propagates through subsequent lazy operations (map, filter, etc.), ensuring consistent output types.
Usage
Use IterableDataset.with_format when you need to convert streaming dataset elements to framework-specific tensor types (PyTorch, TensorFlow, NumPy, JAX) or to tabular formats (Arrow, Pandas, Polars).
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/iterable_dataset.py - Lines: L2722-L2777
Signature
def with_format(
self,
type: Optional[str] = None,
) -> "IterableDataset":
Import
from datasets import load_dataset
ds = load_dataset("my_dataset", split="train", streaming=True)
# with_format is a method on the returned IterableDataset
ds = ds.with_format("torch")
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| type | Optional[str] |
No | Output format type. One of None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars'. None returns Python objects (default).
|
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | IterableDataset |
A new streaming dataset configured to yield elements in the specified format. |
Usage Examples
Basic Usage
from datasets import load_dataset
from transformers import AutoTokenizer
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation", streaming=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
# Tokenize and convert to PyTorch tensors
ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True)
ds = ds.with_format("torch")
example = next(iter(ds))
# example['input_ids'] is a torch.Tensor
# example['label'] is a torch.Tensor