Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Huggingface Datasets Num Proc Guidelines

From Leeroopedia

Overview

Parallelization guidelines for num_proc in Dataset.map(), DataLoader workers, and distributed training. The Hugging Face Datasets library enforces several automatic capping rules to prevent resource waste when the requested parallelism exceeds the actual data granularity (number of examples or shards).

Description

The library applies three distinct capping rules depending on the context:

  1. Dataset operations (map, filter): When num_proc exceeds the number of rows in the dataset, it is silently reduced to len(dataset). Each process needs at least one example to work on; any additional processes would be idle.
  2. Dataset building (download/prepare): When num_proc exceeds the number of shards in a given split, it is reduced to the number of available shards. If only one shard exists, multiprocessing is disabled entirely by resetting num_proc to 1.
  3. DataLoader workers and distributed training: When num_workers exceeds dataset.num_shards, surplus workers are stopped. In distributed settings, shard-level distribution is preferred over example-level skipping, but this requires that num_shards be a factor of world_size.

Usage

These guidelines apply whenever you configure:

  • num_proc in Dataset.map(), Dataset.filter(), or other batch-processing operations
  • num_workers in a PyTorch DataLoader wrapping an IterableDataset
  • world_size and shard counts in distributed training with split_dataset_by_node()

The Insight (Rule of Thumb)

  • Action 1: For Dataset.map(num_proc=N), set N <= dataset size AND N <= number of shards. The library will auto-cap, but setting it correctly avoids warning messages and upfront overhead from spawning unnecessary processes.
  • Action 2: For DataLoader num_workers, set <= dataset.num_shards. Use dataset.reshard() to increase the number of shards if you need more parallelism.
  • Action 3: For distributed training, ensure num_shards % world_size == 0 for optimal shard-level distribution. When this condition is not met, the library falls back to example-level skipping which is significantly less efficient.
  • Trade-off: Too many workers wastes resources (process spawn overhead, memory duplication, idle workers); too few leaves CPUs idle and underutilizes available hardware.

Reasoning

The fundamental constraint is that each worker is assigned one or more shards of data to process. A shard is the smallest independently-processable unit: a single Arrow file, a single data source file, or a single partition of in-memory rows.

When the number of workers exceeds the number of shards:

  • Extra workers have no data to process and sit idle
  • Process spawning overhead is incurred for no benefit
  • In distributed training, falling back from shard-level to example-level distribution means every worker must iterate through the entire dataset and skip non-assigned examples, wasting I/O bandwidth

The optimal configuration is num_proc == num_shards (or a divisor thereof), ensuring each worker gets an equal share of the data with no wasted resources. For Dataset.map() on in-memory datasets, the data is partitioned into num_proc chunks, so num_proc <= len(dataset) is the hard upper bound.

Code Evidence

Capping num_proc to dataset size (arrow_dataset.py:3122-3126)

if num_proc is not None and num_proc > len(self):
    num_proc = len(self)
    logger.warning(
        f"num_proc must be <= {len(self)}. Reducing num_proc to {num_proc} for dataset of size {len(self)}."
    )

This ensures no process is spawned without at least one example to handle during Dataset.map() and similar operations.

Capping num_proc to shard count during build (builder.py:1407-1418)

if num_proc and num_proc > 1:
    num_original_shards = _number_of_shards_in_gen_kwargs(split_generator.gen_kwargs)
    if num_original_shards <= 1:
        logger.warning(
            f"Setting num_proc from {num_proc} back to 1 for the {split_info.name} split to disable multiprocessing as it only contains one shard."
        )
        num_proc = 1
    elif num_original_shards < num_proc:
        logger.warning(
            f"Setting num_proc from {num_proc} to {num_original_shards} for the {split_info.name} split as it only contains {num_original_shards} shards."
        )
        num_proc = num_original_shards

During dataset preparation, this two-tier check either disables multiprocessing entirely (single shard) or reduces to the shard count.

DataLoader worker capping (iterable_dataset.py:2428-2437)

if self._is_main_process() and ex_iterable.num_shards < worker_info.num_workers:
    logger.warning(
        f"Too many dataloader workers: {worker_info.num_workers} (max is dataset.num_shards={ex_iterable.num_shards}). "
        f"Stopping {worker_info.num_workers - ex_iterable.num_shards} dataloader workers."
    )
    logger.info(
        f"To parallelize data loading, we give each process some shards (or data sources) to process. "
        f"Therefore it's unnecessary to have a number of workers greater than dataset.num_shards={ex_iterable.num_shards}. "
        f"To enable more parallelism, please split the dataset in more files than {ex_iterable.num_shards} or try `dataset = dataset.reshard()` ..."
    )

PyTorch DataLoader workers beyond the shard count are stopped. The info message explicitly recommends dataset.reshard() as the solution for increasing parallelism.

Distributed training shard distribution (iterable_dataset.py:2510-2525)

Shard-level distribution is the preferred strategy for distributed training. Each node receives a disjoint subset of shards, eliminating redundant I/O. However, when num_shards is not evenly divisible by world_size, the library falls back to example-level skipping where every node reads all shards but only processes every Nth example. This fallback is significantly less efficient due to wasted I/O.

Process pool spawning (parallel/parallel.py:62-72)

The Pool is spawned with a tqdm lock initializer to ensure progress bars render correctly across multiple processes. This is the underlying mechanism that executes the parallel work once num_proc has been validated and capped.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment