Implementation:Huggingface Datasets IterableDataset Map

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for lazily applying transformation functions to streaming dataset elements provided by the HuggingFace Datasets library.

Description

IterableDataset.map wraps the dataset's internal example iterable with a MappedExamplesIterable. The transformation function is stored but not executed until the dataset is iterated. The method supports both per-example and batched application modes, optional index passing, column selection, column removal, and feature type overrides.

Internally, the method:

Normalizes input_columns and remove_columns from strings to lists.
Defaults to an identity function if no function is provided.
Handles formatting for Arrow-backed iterables by wrapping with FormattedExamplesIterable and RebatchedArrowExamplesIterable as needed.
Wraps the iterable in MappedExamplesIterable with all configuration parameters.
Returns a new IterableDataset with the wrapped iterable, preserving split, formatting, and distributed settings.

If the function is asynchronous, the map operation runs the function in parallel with up to one thousand simultaneous calls.

Usage

Use IterableDataset.map when you need to transform elements of a streaming dataset on-the-fly, such as tokenizing text, adding computed fields, or reformatting data structures.

Code Reference

Source Location

Repository: datasets
File: src/datasets/iterable_dataset.py
Lines: L2779-L2928

Signature

def map(
    self,
    function: Optional[Callable] = None,
    with_indices: bool = False,
    input_columns: Optional[Union[str, list[str]]] = None,
    batched: bool = False,
    batch_size: Optional[int] = 1000,
    drop_last_batch: bool = False,
    remove_columns: Optional[Union[str, list[str]]] = None,
    features: Optional[Features] = None,
    fn_kwargs: Optional[dict] = None,
) -> "IterableDataset":

Import

from datasets import load_dataset

ds = load_dataset("my_dataset", split="train", streaming=True)
# map is a method on the returned IterableDataset
ds = ds.map(my_function)

I/O Contract

Inputs

Name	Type	Required	Description
function	`Optional[Callable]`	No	Function to apply to each example or batch. Defaults to identity. Can be async.
with_indices	`bool`	No	If True, passes element indices to the function. Defaults to False.
input_columns	`Optional[Union[str, list[str]]]`	No	Columns to pass as positional arguments. If None, passes entire example dict.
batched	`bool`	No	If True, provides batches of examples to the function. Defaults to False.
batch_size	`Optional[int]`	No	Number of examples per batch when `batched=True`. Defaults to 1000.
drop_last_batch	`bool`	No	Whether to drop the last incomplete batch. Defaults to False.
remove_columns	`Optional[Union[str, list[str]]]`	No	Columns to remove from the output.
features	`Optional[Features]`	No	Feature types for the resulting dataset.
fn_kwargs	`Optional[dict]`	No	Additional keyword arguments passed to the function.

Outputs

Name	Type	Description
dataset	`IterableDataset`	A new streaming dataset with the map transformation registered in its pipeline.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True)

def add_prefix(example):
    example["text"] = "Review: " + example["text"]
    return example

ds = ds.map(add_prefix)
list(ds.take(3))
# [{'label': 1, 'text': 'Review: the rock is destined ...'}, ...]

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Streaming_Map_Transform

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment