Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets IterableDataset With Format

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for configuring the output format of streamed dataset elements provided by the HuggingFace Datasets library.

Description

IterableDataset.with_format returns a new IterableDataset that is configured to convert each yielded element to the specified format type. The method does not modify the underlying data or iterable pipeline; it only sets a FormattingConfig object on the new dataset instance.

Internally, the method:

  1. Resolves the format type string via get_format_type_from_alias(type), which normalizes aliases (e.g., "pt" to "torch").
  2. Creates a new FormattingConfig(format_type=type).
  3. Returns a new IterableDataset with the same _ex_iterable, info, and split, but with the new formatting configuration.
  4. The formatting is applied during iteration: in __iter__, if a formatting config is present, a format-specific Formatter is instantiated, and each element is passed through formatter.format_row(pa_table).

The formatting configuration propagates through subsequent lazy operations (map, filter, etc.), ensuring consistent output types.

Usage

Use IterableDataset.with_format when you need to convert streaming dataset elements to framework-specific tensor types (PyTorch, TensorFlow, NumPy, JAX) or to tabular formats (Arrow, Pandas, Polars).

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/iterable_dataset.py
  • Lines: L2722-L2777

Signature

def with_format(
    self,
    type: Optional[str] = None,
) -> "IterableDataset":

Import

from datasets import load_dataset

ds = load_dataset("my_dataset", split="train", streaming=True)
# with_format is a method on the returned IterableDataset
ds = ds.with_format("torch")

I/O Contract

Inputs

Name Type Required Description
type Optional[str] No Output format type. One of None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars'. None returns Python objects (default).

Outputs

Name Type Description
dataset IterableDataset A new streaming dataset configured to yield elements in the specified format.

Usage Examples

Basic Usage

from datasets import load_dataset
from transformers import AutoTokenizer

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation", streaming=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Tokenize and convert to PyTorch tensors
ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True)
ds = ds.with_format("torch")

example = next(iter(ds))
# example['input_ids'] is a torch.Tensor
# example['label'] is a torch.Tensor

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment