Implementation:Huggingface Datasets IterableDataset Map
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for lazily applying transformation functions to streaming dataset elements provided by the HuggingFace Datasets library.
Description
IterableDataset.map wraps the dataset's internal example iterable with a MappedExamplesIterable. The transformation function is stored but not executed until the dataset is iterated. The method supports both per-example and batched application modes, optional index passing, column selection, column removal, and feature type overrides.
Internally, the method:
- Normalizes
input_columnsandremove_columnsfrom strings to lists. - Defaults to an identity function if no function is provided.
- Handles formatting for Arrow-backed iterables by wrapping with
FormattedExamplesIterableandRebatchedArrowExamplesIterableas needed. - Wraps the iterable in
MappedExamplesIterablewith all configuration parameters. - Returns a new
IterableDatasetwith the wrapped iterable, preserving split, formatting, and distributed settings.
If the function is asynchronous, the map operation runs the function in parallel with up to one thousand simultaneous calls.
Usage
Use IterableDataset.map when you need to transform elements of a streaming dataset on-the-fly, such as tokenizing text, adding computed fields, or reformatting data structures.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/iterable_dataset.py - Lines: L2779-L2928
Signature
def map(
self,
function: Optional[Callable] = None,
with_indices: bool = False,
input_columns: Optional[Union[str, list[str]]] = None,
batched: bool = False,
batch_size: Optional[int] = 1000,
drop_last_batch: bool = False,
remove_columns: Optional[Union[str, list[str]]] = None,
features: Optional[Features] = None,
fn_kwargs: Optional[dict] = None,
) -> "IterableDataset":
Import
from datasets import load_dataset
ds = load_dataset("my_dataset", split="train", streaming=True)
# map is a method on the returned IterableDataset
ds = ds.map(my_function)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| function | Optional[Callable] |
No | Function to apply to each example or batch. Defaults to identity. Can be async. |
| with_indices | bool |
No | If True, passes element indices to the function. Defaults to False. |
| input_columns | Optional[Union[str, list[str]]] |
No | Columns to pass as positional arguments. If None, passes entire example dict. |
| batched | bool |
No | If True, provides batches of examples to the function. Defaults to False. |
| batch_size | Optional[int] |
No | Number of examples per batch when batched=True. Defaults to 1000.
|
| drop_last_batch | bool |
No | Whether to drop the last incomplete batch. Defaults to False. |
| remove_columns | Optional[Union[str, list[str]]] |
No | Columns to remove from the output. |
| features | Optional[Features] |
No | Feature types for the resulting dataset. |
| fn_kwargs | Optional[dict] |
No | Additional keyword arguments passed to the function. |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | IterableDataset |
A new streaming dataset with the map transformation registered in its pipeline. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True)
def add_prefix(example):
example["text"] = "Review: " + example["text"]
return example
ds = ds.map(add_prefix)
list(ds.take(3))
# [{'label': 1, 'text': 'Review: the rock is destined ...'}, ...]