Implementation:Huggingface Datasets Dataset Map

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, ML_Preprocessing
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for applying a transformation function to all examples in a dataset provided by the HuggingFace Datasets library.

Description

The map method applies a user-defined function to every example (or batch of examples) in the dataset, returning a new dataset with the transformed data. It supports element-wise and batched processing, multiprocessing via num_proc, caching of results, column removal during mapping, and both synchronous and asynchronous functions. If the function returns a column that already exists, it overwrites that column. If the function returns None, the dataset is returned unchanged. The method also supports providing example indices and process rank to the function via with_indices and with_rank.

Usage

Use Dataset.map for all element-level transformations including tokenization, feature engineering, data cleaning, text normalization, and any operation that modifies, adds, or restructures columns based on per-example computation.

Code Reference

Source Location

Repository: datasets
File: src/datasets/arrow_dataset.py
Lines: L2932-L3375

Signature

@transmit_format
def map(
    self,
    function: Optional[Callable] = None,
    with_indices: bool = False,
    with_rank: bool = False,
    input_columns: Optional[Union[str, list[str]]] = None,
    batched: bool = False,
    batch_size: Optional[int] = 1000,
    drop_last_batch: bool = False,
    remove_columns: Optional[Union[str, list[str]]] = None,
    keep_in_memory: bool = False,
    load_from_cache_file: Optional[bool] = None,
    cache_file_name: Optional[str] = None,
    writer_batch_size: Optional[int] = 1000,
    features: Optional[Features] = None,
    disable_nullable: bool = False,
    fn_kwargs: Optional[dict] = None,
    num_proc: Optional[int] = None,
    suffix_template: str = "_{rank:05d}_of_{num_proc:05d}",
    new_fingerprint: Optional[str] = None,
    desc: Optional[str] = None,
    try_original_type: Optional[bool] = True,
) -> "Dataset":

Import

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
ds = ds.map(lambda example: {"text": "Review: " + example["text"]})

I/O Contract

Inputs

Name	Type	Required	Description
function	`Optional[Callable]`	No	Function to apply. Signature depends on `batched`, `with_indices`, `with_rank`. Defaults to identity.
with_indices	`bool`	No	Provide example indices to `function`. Defaults to `False`.
with_rank	`bool`	No	Provide process rank to `function`. Defaults to `False`.
input_columns	`Optional[Union[str, list[str]]]`	No	Columns to pass as positional arguments. If `None`, all formatted columns are passed as a dict.
batched	`bool`	No	Whether to provide batches of examples to `function`. Defaults to `False`.
batch_size	`Optional[int]`	No	Number of examples per batch if `batched=True`. Defaults to 1000.
drop_last_batch	`bool`	No	Whether to drop the last incomplete batch. Defaults to `False`.
remove_columns	`Optional[Union[str, list[str]]]`	No	Columns to remove before applying the function.
keep_in_memory	`bool`	No	Keep result in memory instead of caching to disk. Defaults to `False`.
load_from_cache_file	`Optional[bool]`	No	Use cached result if available. Defaults to `True` if caching is enabled.
cache_file_name	`Optional[str]`	No	Path for the cache file. Auto-generated if `None`.
writer_batch_size	`Optional[int]`	No	Rows per write operation for cache writer. Defaults to 1000.
features	`Optional[Features]`	No	Specific Features for the output cache file.
disable_nullable	`bool`	No	Disallow null values. Defaults to `False`.
fn_kwargs	`Optional[dict]`	No	Keyword arguments passed to `function`.
num_proc	`Optional[int]`	No	Number of processes for multiprocessing. `None` or 0 means no multiprocessing.
suffix_template	`str`	No	Suffix template for shard cache files. Defaults to `"_{rank:05d}_of_{num_proc:05d}"`.
new_fingerprint	`Optional[str]`	No	The new fingerprint after transform. Auto-computed if `None`.
desc	`Optional[str]`	No	Description displayed alongside the progress bar.
try_original_type	`Optional[bool]`	No	Try to keep original column types. Defaults to `True`.

Outputs

Name	Type	Description
return	`Dataset`	A new dataset with the function applied to all examples.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")

# Element-wise mapping
def add_prefix(example):
    example["text"] = "Review: " + example["text"]
    return example

ds = ds.map(add_prefix)

# Batched mapping (e.g., tokenization)
ds = ds.map(lambda batch: tokenizer(batch["text"]), batched=True)

# Multiprocessing
ds = ds.map(add_prefix, num_proc=4)

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Dataset_Mapping

Requires Environment

Environment:Huggingface_Datasets_Python_PyArrow_Core

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment