Heuristic:Huggingface Datasets Num Proc Guidelines
Overview
Parallelization guidelines for num_proc in Dataset.map(), DataLoader workers, and distributed training. The Hugging Face Datasets library enforces several automatic capping rules to prevent resource waste when the requested parallelism exceeds the actual data granularity (number of examples or shards).
Description
The library applies three distinct capping rules depending on the context:
- Dataset operations (map, filter): When
num_procexceeds the number of rows in the dataset, it is silently reduced tolen(dataset). Each process needs at least one example to work on; any additional processes would be idle. - Dataset building (download/prepare): When
num_procexceeds the number of shards in a given split, it is reduced to the number of available shards. If only one shard exists, multiprocessing is disabled entirely by resettingnum_procto 1. - DataLoader workers and distributed training: When
num_workersexceedsdataset.num_shards, surplus workers are stopped. In distributed settings, shard-level distribution is preferred over example-level skipping, but this requires thatnum_shardsbe a factor ofworld_size.
Usage
These guidelines apply whenever you configure:
num_procinDataset.map(),Dataset.filter(), or other batch-processing operationsnum_workersin a PyTorchDataLoaderwrapping anIterableDatasetworld_sizeand shard counts in distributed training withsplit_dataset_by_node()
The Insight (Rule of Thumb)
- Action 1: For
Dataset.map(num_proc=N), set N <= dataset size AND N <= number of shards. The library will auto-cap, but setting it correctly avoids warning messages and upfront overhead from spawning unnecessary processes. - Action 2: For DataLoader
num_workers, set <=dataset.num_shards. Usedataset.reshard()to increase the number of shards if you need more parallelism. - Action 3: For distributed training, ensure
num_shards % world_size == 0for optimal shard-level distribution. When this condition is not met, the library falls back to example-level skipping which is significantly less efficient. - Trade-off: Too many workers wastes resources (process spawn overhead, memory duplication, idle workers); too few leaves CPUs idle and underutilizes available hardware.
Reasoning
The fundamental constraint is that each worker is assigned one or more shards of data to process. A shard is the smallest independently-processable unit: a single Arrow file, a single data source file, or a single partition of in-memory rows.
When the number of workers exceeds the number of shards:
- Extra workers have no data to process and sit idle
- Process spawning overhead is incurred for no benefit
- In distributed training, falling back from shard-level to example-level distribution means every worker must iterate through the entire dataset and skip non-assigned examples, wasting I/O bandwidth
The optimal configuration is num_proc == num_shards (or a divisor thereof), ensuring each worker gets an equal share of the data with no wasted resources. For Dataset.map() on in-memory datasets, the data is partitioned into num_proc chunks, so num_proc <= len(dataset) is the hard upper bound.
Code Evidence
Capping num_proc to dataset size (arrow_dataset.py:3122-3126)
if num_proc is not None and num_proc > len(self):
num_proc = len(self)
logger.warning(
f"num_proc must be <= {len(self)}. Reducing num_proc to {num_proc} for dataset of size {len(self)}."
)
This ensures no process is spawned without at least one example to handle during Dataset.map() and similar operations.
Capping num_proc to shard count during build (builder.py:1407-1418)
if num_proc and num_proc > 1:
num_original_shards = _number_of_shards_in_gen_kwargs(split_generator.gen_kwargs)
if num_original_shards <= 1:
logger.warning(
f"Setting num_proc from {num_proc} back to 1 for the {split_info.name} split to disable multiprocessing as it only contains one shard."
)
num_proc = 1
elif num_original_shards < num_proc:
logger.warning(
f"Setting num_proc from {num_proc} to {num_original_shards} for the {split_info.name} split as it only contains {num_original_shards} shards."
)
num_proc = num_original_shards
During dataset preparation, this two-tier check either disables multiprocessing entirely (single shard) or reduces to the shard count.
DataLoader worker capping (iterable_dataset.py:2428-2437)
if self._is_main_process() and ex_iterable.num_shards < worker_info.num_workers:
logger.warning(
f"Too many dataloader workers: {worker_info.num_workers} (max is dataset.num_shards={ex_iterable.num_shards}). "
f"Stopping {worker_info.num_workers - ex_iterable.num_shards} dataloader workers."
)
logger.info(
f"To parallelize data loading, we give each process some shards (or data sources) to process. "
f"Therefore it's unnecessary to have a number of workers greater than dataset.num_shards={ex_iterable.num_shards}. "
f"To enable more parallelism, please split the dataset in more files than {ex_iterable.num_shards} or try `dataset = dataset.reshard()` ..."
)
PyTorch DataLoader workers beyond the shard count are stopped. The info message explicitly recommends dataset.reshard() as the solution for increasing parallelism.
Distributed training shard distribution (iterable_dataset.py:2510-2525)
Shard-level distribution is the preferred strategy for distributed training. Each node receives a disjoint subset of shards, eliminating redundant I/O. However, when num_shards is not evenly divisible by world_size, the library falls back to example-level skipping where every node reads all shards but only processes every Nth example. This fallback is significantly less efficient due to wasted I/O.
Process pool spawning (parallel/parallel.py:62-72)
The Pool is spawned with a tqdm lock initializer to ensure progress bars render correctly across multiple processes. This is the underlying mechanism that executes the parallel work once num_proc has been validated and capped.