Principle:Deepspeedai DeepSpeed SP Data Preparation
Overview
Adapting data loading to distribute long sequences across sequence-parallel ranks by gathering and re-sharding along the sequence dimension.
Detailed Description
Sequence-parallel training requires each GPU to receive only its portion of the full sequence. The UlyssesSPDataLoaderAdapter wraps a standard PyTorch DataLoader and handles:
- Gathering samples from all SP ranks to construct the full batch
- Sharding each sequence along the sequence dimension (splitting tokens into
sp_sizechunks) - Distributing the correct shard to each SP rank
This ensures each GPU processes only S/sp_size tokens while maintaining correct sequence ordering.
The adapter implements a round-robin scheme: all SP ranks participate in processing a single data loader sample. When sp_world_size iterations are completed, it is equivalent to performing a single iteration of normal data-parallel training. The key invariant is:
- Rank 0 gets shard 0 of batch 0
- Rank 1 gets shard 1 of batch 0
- Rank k gets shard k of batch 0
Then on the next iteration:
- Rank 0 gets shard 0 of batch 1
- Rank 1 gets shard 1 of batch 1
- And so on...
The adapter also handles special processing for training:
- Labels are shifted (for causal language modeling) and padded with
-100(ignore index) - Attention masks are not used by Ulysses (it relies on
position_idsinstead, which are much smaller in memory) - Non-tensor entries in the batch dictionary are copied to all ranks unchanged
- Variable-length sequences are supported via all-gather of sequence lengths before redistribution
Theoretical Basis
For sequence length S across P SP ranks, rank i receives tokens [i*S/P : (i+1)*S/P]. The data loader must handle:
- Padding: Sequences not evenly divisible by P raise a
ValueError, requiring the upstream data loader to produce sequences with lengths divisible bysp_size. - Communication pattern: All-gather is used to collect batches from all SP ranks, then each rank takes its local shard. This ensures all ranks see the same full sequences before sharding.
- Memory optimization: After sharding, non-local data is kept on CPU to minimize GPU memory usage, which is critical for very long sequences (>10M tokens at 32+ GPUs can consume GBs of memory for prefill buffers).
| Step | Operation | Data Shape Per Rank |
|---|---|---|
| 1. DataLoader yields | Original batch | [B, S_local]
|
| 2. All-gather lengths | Collect sequence lengths | [sp_world_size]
|
| 3. All-gather tensors | Collect full batches | [B, S_i] per rank i
|
| 4. Shard on seq dim | Split each full batch | [B, S_full / sp_size]
|
| 5. Return local shard | Yield to training loop | [B, S_full / sp_size]
|
Related Pages
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/
- https://arxiv.org/abs/2506.13996
Last updated: 2026-02-09 00:00 GMT