Principle:Deepspeedai DeepSpeed SP Data Preparation

Overview

Adapting data loading to distribute long sequences across sequence-parallel ranks by gathering and re-sharding along the sequence dimension.

Detailed Description

Sequence-parallel training requires each GPU to receive only its portion of the full sequence. The UlyssesSPDataLoaderAdapter wraps a standard PyTorch DataLoader and handles:

Gathering samples from all SP ranks to construct the full batch
Sharding each sequence along the sequence dimension (splitting tokens into sp_size chunks)
Distributing the correct shard to each SP rank

This ensures each GPU processes only S/sp_size tokens while maintaining correct sequence ordering.

The adapter implements a round-robin scheme: all SP ranks participate in processing a single data loader sample. When sp_world_size iterations are completed, it is equivalent to performing a single iteration of normal data-parallel training. The key invariant is:

Rank 0 gets shard 0 of batch 0
Rank 1 gets shard 1 of batch 0
Rank k gets shard k of batch 0

Then on the next iteration:

Rank 0 gets shard 0 of batch 1
Rank 1 gets shard 1 of batch 1
And so on...

The adapter also handles special processing for training:

Labels are shifted (for causal language modeling) and padded with -100 (ignore index)
Attention masks are not used by Ulysses (it relies on position_ids instead, which are much smaller in memory)
Non-tensor entries in the batch dictionary are copied to all ranks unchanged
Variable-length sequences are supported via all-gather of sequence lengths before redistribution

Theoretical Basis

For sequence length S across P SP ranks, rank i receives tokens [i*S/P : (i+1)*S/P]. The data loader must handle:

Padding: Sequences not evenly divisible by P raise a ValueError, requiring the upstream data loader to produce sequences with lengths divisible by sp_size.
Communication pattern: All-gather is used to collect batches from all SP ranks, then each rank takes its local shard. This ensures all ranks see the same full sequences before sharding.
Memory optimization: After sharding, non-local data is kept on CPU to minimize GPU memory usage, which is critical for very long sequences (>10M tokens at 32+ GPUs can consume GBs of memory for prefill buffers).

Step	Operation	Data Shape Per Rank
1. DataLoader yields	Original batch	`[B, S_local]`
2. All-gather lengths	Collect sequence lengths	`[sp_world_size]`
3. All-gather tensors	Collect full batches	`[B, S_i]` per rank i
4. Shard on seq dim	Split each full batch	`[B, S_full / sp_size]`
5. Return local shard	Yield to training loop	`[B, S_full / sp_size]`

Related Pages

Implementation:Deepspeedai_DeepSpeed_UlyssesSPDataLoaderAdapter_Init

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment