Principle:Huggingface Datasets Distributed Dataset Splitting
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Partitioning datasets across multiple nodes for distributed training ensures each node processes a disjoint subset of the data without redundancy or gaps.
Description
In distributed training, multiple processes (nodes or GPUs) train a model in parallel, each consuming a different portion of the dataset. Distributed dataset splitting is the mechanism that assigns a disjoint, non-overlapping partition of the data to each node based on its rank (unique identifier) and the world size (total number of nodes).
The splitting strategy differs depending on the dataset type:
- Map-style datasets (
Dataset): Each node is assigned a contiguous chunk of rows. The dataset is divided intoworld_sizechunks, and node at rankrreceives chunkr. Contiguous allocation maximizes I/O throughput by preserving data locality on disk.
- Iterable (streaming) datasets (
IterableDataset): Two strategies are used depending on shard count:- Shard-aligned: If the number of shards is divisible by
world_size, shards are evenly assigned to nodes. This is the most efficient approach because each node reads complete files without any wasted I/O. - Interleaved: If shard count is not divisible by
world_size, each node keeps everyworld_size-th example (i.e., noderkeeps examples at indicesr, r + world_size, r + 2*world_size, ...). This is less efficient because every node must read the full stream but discard most elements.
- Shard-aligned: If the number of shards is divisible by
Key considerations:
- Determinism: All nodes must agree on the same data ordering for the interleaved strategy to produce disjoint subsets. If the dataset is shuffled, the same seed must be used on all nodes.
- No communication: Splitting is computed locally by each node based on rank and world_size; no inter-node communication is required.
- Composability: Distributed splitting can be combined with all other streaming operations (map, filter, shuffle, take, skip).
Usage
Use distributed dataset splitting when:
- You are training a model with PyTorch DistributedDataParallel (DDP), DeepSpeed, or similar multi-process frameworks.
- You need each GPU/node to process a unique subset of data per epoch.
- You are using streaming datasets in a multi-node setup and want to avoid duplicate data processing.
- You need to scale data loading across a cluster without a centralized data server.
Theoretical Basis
Distributed dataset splitting implements a partition function over the data: given a set D of N elements and K nodes, the function produces K disjoint subsets D_0, D_1, ..., D_{K-1} such that D_0 U D_1 U ... U D_{K-1} = D and D_i ∩ D_j = {} for i != j.
For map-style datasets, this is a block partition: D_r = D[r*N/K : (r+1)*N/K]. For iterable datasets with aligned shards, it is a shard-level partition. For the interleaved case, it is a strided partition (also called round-robin assignment): D_r = {D[i] : i mod K == r}.
The strided partition is related to systematic sampling in statistics, where every K-th element is selected starting from a given offset. While not a random sample, it ensures uniform coverage of the dataset across nodes when the data is well-ordered.