Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets IterableDataset Map

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for lazily applying transformation functions to streaming dataset elements provided by the HuggingFace Datasets library.

Description

IterableDataset.map wraps the dataset's internal example iterable with a MappedExamplesIterable. The transformation function is stored but not executed until the dataset is iterated. The method supports both per-example and batched application modes, optional index passing, column selection, column removal, and feature type overrides.

Internally, the method:

  1. Normalizes input_columns and remove_columns from strings to lists.
  2. Defaults to an identity function if no function is provided.
  3. Handles formatting for Arrow-backed iterables by wrapping with FormattedExamplesIterable and RebatchedArrowExamplesIterable as needed.
  4. Wraps the iterable in MappedExamplesIterable with all configuration parameters.
  5. Returns a new IterableDataset with the wrapped iterable, preserving split, formatting, and distributed settings.

If the function is asynchronous, the map operation runs the function in parallel with up to one thousand simultaneous calls.

Usage

Use IterableDataset.map when you need to transform elements of a streaming dataset on-the-fly, such as tokenizing text, adding computed fields, or reformatting data structures.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/iterable_dataset.py
  • Lines: L2779-L2928

Signature

def map(
    self,
    function: Optional[Callable] = None,
    with_indices: bool = False,
    input_columns: Optional[Union[str, list[str]]] = None,
    batched: bool = False,
    batch_size: Optional[int] = 1000,
    drop_last_batch: bool = False,
    remove_columns: Optional[Union[str, list[str]]] = None,
    features: Optional[Features] = None,
    fn_kwargs: Optional[dict] = None,
) -> "IterableDataset":

Import

from datasets import load_dataset

ds = load_dataset("my_dataset", split="train", streaming=True)
# map is a method on the returned IterableDataset
ds = ds.map(my_function)

I/O Contract

Inputs

Name Type Required Description
function Optional[Callable] No Function to apply to each example or batch. Defaults to identity. Can be async.
with_indices bool No If True, passes element indices to the function. Defaults to False.
input_columns Optional[Union[str, list[str]]] No Columns to pass as positional arguments. If None, passes entire example dict.
batched bool No If True, provides batches of examples to the function. Defaults to False.
batch_size Optional[int] No Number of examples per batch when batched=True. Defaults to 1000.
drop_last_batch bool No Whether to drop the last incomplete batch. Defaults to False.
remove_columns Optional[Union[str, list[str]]] No Columns to remove from the output.
features Optional[Features] No Feature types for the resulting dataset.
fn_kwargs Optional[dict] No Additional keyword arguments passed to the function.

Outputs

Name Type Description
dataset IterableDataset A new streaming dataset with the map transformation registered in its pipeline.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True)

def add_prefix(example):
    example["text"] = "Review: " + example["text"]
    return example

ds = ds.map(add_prefix)
list(ds.take(3))
# [{'label': 1, 'text': 'Review: the rock is destined ...'}, ...]

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment