Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset Map

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, ML_Preprocessing
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for applying a transformation function to all examples in a dataset provided by the HuggingFace Datasets library.

Description

The map method applies a user-defined function to every example (or batch of examples) in the dataset, returning a new dataset with the transformed data. It supports element-wise and batched processing, multiprocessing via num_proc, caching of results, column removal during mapping, and both synchronous and asynchronous functions. If the function returns a column that already exists, it overwrites that column. If the function returns None, the dataset is returned unchanged. The method also supports providing example indices and process rank to the function via with_indices and with_rank.

Usage

Use Dataset.map for all element-level transformations including tokenization, feature engineering, data cleaning, text normalization, and any operation that modifies, adds, or restructures columns based on per-example computation.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L2932-L3375

Signature

@transmit_format
def map(
    self,
    function: Optional[Callable] = None,
    with_indices: bool = False,
    with_rank: bool = False,
    input_columns: Optional[Union[str, list[str]]] = None,
    batched: bool = False,
    batch_size: Optional[int] = 1000,
    drop_last_batch: bool = False,
    remove_columns: Optional[Union[str, list[str]]] = None,
    keep_in_memory: bool = False,
    load_from_cache_file: Optional[bool] = None,
    cache_file_name: Optional[str] = None,
    writer_batch_size: Optional[int] = 1000,
    features: Optional[Features] = None,
    disable_nullable: bool = False,
    fn_kwargs: Optional[dict] = None,
    num_proc: Optional[int] = None,
    suffix_template: str = "_{rank:05d}_of_{num_proc:05d}",
    new_fingerprint: Optional[str] = None,
    desc: Optional[str] = None,
    try_original_type: Optional[bool] = True,
) -> "Dataset":

Import

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
ds = ds.map(lambda example: {"text": "Review: " + example["text"]})

I/O Contract

Inputs

Name Type Required Description
function Optional[Callable] No Function to apply. Signature depends on batched, with_indices, with_rank. Defaults to identity.
with_indices bool No Provide example indices to function. Defaults to False.
with_rank bool No Provide process rank to function. Defaults to False.
input_columns Optional[Union[str, list[str]]] No Columns to pass as positional arguments. If None, all formatted columns are passed as a dict.
batched bool No Whether to provide batches of examples to function. Defaults to False.
batch_size Optional[int] No Number of examples per batch if batched=True. Defaults to 1000.
drop_last_batch bool No Whether to drop the last incomplete batch. Defaults to False.
remove_columns Optional[Union[str, list[str]]] No Columns to remove before applying the function.
keep_in_memory bool No Keep result in memory instead of caching to disk. Defaults to False.
load_from_cache_file Optional[bool] No Use cached result if available. Defaults to True if caching is enabled.
cache_file_name Optional[str] No Path for the cache file. Auto-generated if None.
writer_batch_size Optional[int] No Rows per write operation for cache writer. Defaults to 1000.
features Optional[Features] No Specific Features for the output cache file.
disable_nullable bool No Disallow null values. Defaults to False.
fn_kwargs Optional[dict] No Keyword arguments passed to function.
num_proc Optional[int] No Number of processes for multiprocessing. None or 0 means no multiprocessing.
suffix_template str No Suffix template for shard cache files. Defaults to "_{rank:05d}_of_{num_proc:05d}".
new_fingerprint Optional[str] No The new fingerprint after transform. Auto-computed if None.
desc Optional[str] No Description displayed alongside the progress bar.
try_original_type Optional[bool] No Try to keep original column types. Defaults to True.

Outputs

Name Type Description
return Dataset A new dataset with the function applied to all examples.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")

# Element-wise mapping
def add_prefix(example):
    example["text"] = "Review: " + example["text"]
    return example

ds = ds.map(add_prefix)

# Batched mapping (e.g., tokenization)
ds = ds.map(lambda batch: tokenizer(batch["text"]), batched=True)

# Multiprocessing
ds = ds.map(add_prefix, num_proc=4)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment