Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Huggingface Datasets Distributed Dataset Splitting

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Partitioning datasets across multiple nodes for distributed training ensures each node processes a disjoint subset of the data without redundancy or gaps.

Description

In distributed training, multiple processes (nodes or GPUs) train a model in parallel, each consuming a different portion of the dataset. Distributed dataset splitting is the mechanism that assigns a disjoint, non-overlapping partition of the data to each node based on its rank (unique identifier) and the world size (total number of nodes).

The splitting strategy differs depending on the dataset type:

  • Map-style datasets (Dataset): Each node is assigned a contiguous chunk of rows. The dataset is divided into world_size chunks, and node at rank r receives chunk r. Contiguous allocation maximizes I/O throughput by preserving data locality on disk.
  • Iterable (streaming) datasets (IterableDataset): Two strategies are used depending on shard count:
    • Shard-aligned: If the number of shards is divisible by world_size, shards are evenly assigned to nodes. This is the most efficient approach because each node reads complete files without any wasted I/O.
    • Interleaved: If shard count is not divisible by world_size, each node keeps every world_size-th example (i.e., node r keeps examples at indices r, r + world_size, r + 2*world_size, ...). This is less efficient because every node must read the full stream but discard most elements.

Key considerations:

  • Determinism: All nodes must agree on the same data ordering for the interleaved strategy to produce disjoint subsets. If the dataset is shuffled, the same seed must be used on all nodes.
  • No communication: Splitting is computed locally by each node based on rank and world_size; no inter-node communication is required.
  • Composability: Distributed splitting can be combined with all other streaming operations (map, filter, shuffle, take, skip).

Usage

Use distributed dataset splitting when:

  • You are training a model with PyTorch DistributedDataParallel (DDP), DeepSpeed, or similar multi-process frameworks.
  • You need each GPU/node to process a unique subset of data per epoch.
  • You are using streaming datasets in a multi-node setup and want to avoid duplicate data processing.
  • You need to scale data loading across a cluster without a centralized data server.

Theoretical Basis

Distributed dataset splitting implements a partition function over the data: given a set D of N elements and K nodes, the function produces K disjoint subsets D_0, D_1, ..., D_{K-1} such that D_0 U D_1 U ... U D_{K-1} = D and D_i ∩ D_j = {} for i != j.

For map-style datasets, this is a block partition: D_r = D[r*N/K : (r+1)*N/K]. For iterable datasets with aligned shards, it is a shard-level partition. For the interleaved case, it is a strided partition (also called round-robin assignment): D_r = {D[i] : i mod K == r}.

The strided partition is related to systematic sampling in statistics, where every K-th element is selected starting from a given offset. While not a random sample, it ensures uniform coverage of the dataset across nodes when the data is well-ordered.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment