Implementation:Huggingface Datasets Dataset Map
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ML_Preprocessing |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for applying a transformation function to all examples in a dataset provided by the HuggingFace Datasets library.
Description
The map method applies a user-defined function to every example (or batch of examples) in the dataset, returning a new dataset with the transformed data. It supports element-wise and batched processing, multiprocessing via num_proc, caching of results, column removal during mapping, and both synchronous and asynchronous functions. If the function returns a column that already exists, it overwrites that column. If the function returns None, the dataset is returned unchanged. The method also supports providing example indices and process rank to the function via with_indices and with_rank.
Usage
Use Dataset.map for all element-level transformations including tokenization, feature engineering, data cleaning, text normalization, and any operation that modifies, adds, or restructures columns based on per-example computation.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L2932-L3375
Signature
@transmit_format
def map(
self,
function: Optional[Callable] = None,
with_indices: bool = False,
with_rank: bool = False,
input_columns: Optional[Union[str, list[str]]] = None,
batched: bool = False,
batch_size: Optional[int] = 1000,
drop_last_batch: bool = False,
remove_columns: Optional[Union[str, list[str]]] = None,
keep_in_memory: bool = False,
load_from_cache_file: Optional[bool] = None,
cache_file_name: Optional[str] = None,
writer_batch_size: Optional[int] = 1000,
features: Optional[Features] = None,
disable_nullable: bool = False,
fn_kwargs: Optional[dict] = None,
num_proc: Optional[int] = None,
suffix_template: str = "_{rank:05d}_of_{num_proc:05d}",
new_fingerprint: Optional[str] = None,
desc: Optional[str] = None,
try_original_type: Optional[bool] = True,
) -> "Dataset":
Import
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
ds = ds.map(lambda example: {"text": "Review: " + example["text"]})
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| function | Optional[Callable] |
No | Function to apply. Signature depends on batched, with_indices, with_rank. Defaults to identity.
|
| with_indices | bool |
No | Provide example indices to function. Defaults to False.
|
| with_rank | bool |
No | Provide process rank to function. Defaults to False.
|
| input_columns | Optional[Union[str, list[str]]] |
No | Columns to pass as positional arguments. If None, all formatted columns are passed as a dict.
|
| batched | bool |
No | Whether to provide batches of examples to function. Defaults to False.
|
| batch_size | Optional[int] |
No | Number of examples per batch if batched=True. Defaults to 1000.
|
| drop_last_batch | bool |
No | Whether to drop the last incomplete batch. Defaults to False.
|
| remove_columns | Optional[Union[str, list[str]]] |
No | Columns to remove before applying the function. |
| keep_in_memory | bool |
No | Keep result in memory instead of caching to disk. Defaults to False.
|
| load_from_cache_file | Optional[bool] |
No | Use cached result if available. Defaults to True if caching is enabled.
|
| cache_file_name | Optional[str] |
No | Path for the cache file. Auto-generated if None.
|
| writer_batch_size | Optional[int] |
No | Rows per write operation for cache writer. Defaults to 1000. |
| features | Optional[Features] |
No | Specific Features for the output cache file. |
| disable_nullable | bool |
No | Disallow null values. Defaults to False.
|
| fn_kwargs | Optional[dict] |
No | Keyword arguments passed to function.
|
| num_proc | Optional[int] |
No | Number of processes for multiprocessing. None or 0 means no multiprocessing.
|
| suffix_template | str |
No | Suffix template for shard cache files. Defaults to "_{rank:05d}_of_{num_proc:05d}".
|
| new_fingerprint | Optional[str] |
No | The new fingerprint after transform. Auto-computed if None.
|
| desc | Optional[str] |
No | Description displayed alongside the progress bar. |
| try_original_type | Optional[bool] |
No | Try to keep original column types. Defaults to True.
|
Outputs
| Name | Type | Description |
|---|---|---|
| return | Dataset |
A new dataset with the function applied to all examples. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
# Element-wise mapping
def add_prefix(example):
example["text"] = "Review: " + example["text"]
return example
ds = ds.map(add_prefix)
# Batched mapping (e.g., tokenization)
ds = ds.map(lambda batch: tokenizer(batch["text"]), batched=True)
# Multiprocessing
ds = ds.map(add_prefix, num_proc=4)