Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Deepspeedai DeepSpeed SP Data Preparation

From Leeroopedia


Overview

Adapting data loading to distribute long sequences across sequence-parallel ranks by gathering and re-sharding along the sequence dimension.

Detailed Description

Sequence-parallel training requires each GPU to receive only its portion of the full sequence. The UlyssesSPDataLoaderAdapter wraps a standard PyTorch DataLoader and handles:

  1. Gathering samples from all SP ranks to construct the full batch
  2. Sharding each sequence along the sequence dimension (splitting tokens into sp_size chunks)
  3. Distributing the correct shard to each SP rank

This ensures each GPU processes only S/sp_size tokens while maintaining correct sequence ordering.

The adapter implements a round-robin scheme: all SP ranks participate in processing a single data loader sample. When sp_world_size iterations are completed, it is equivalent to performing a single iteration of normal data-parallel training. The key invariant is:

  • Rank 0 gets shard 0 of batch 0
  • Rank 1 gets shard 1 of batch 0
  • Rank k gets shard k of batch 0

Then on the next iteration:

  • Rank 0 gets shard 0 of batch 1
  • Rank 1 gets shard 1 of batch 1
  • And so on...

The adapter also handles special processing for training:

  • Labels are shifted (for causal language modeling) and padded with -100 (ignore index)
  • Attention masks are not used by Ulysses (it relies on position_ids instead, which are much smaller in memory)
  • Non-tensor entries in the batch dictionary are copied to all ranks unchanged
  • Variable-length sequences are supported via all-gather of sequence lengths before redistribution

Theoretical Basis

For sequence length S across P SP ranks, rank i receives tokens [i*S/P : (i+1)*S/P]. The data loader must handle:

  • Padding: Sequences not evenly divisible by P raise a ValueError, requiring the upstream data loader to produce sequences with lengths divisible by sp_size.
  • Communication pattern: All-gather is used to collect batches from all SP ranks, then each rank takes its local shard. This ensures all ranks see the same full sequences before sharding.
  • Memory optimization: After sharding, non-local data is kept on CPU to minimize GPU memory usage, which is critical for very long sequences (>10M tokens at 32+ GPUs can consume GBs of memory for prefill buffers).
Step Operation Data Shape Per Rank
1. DataLoader yields Original batch [B, S_local]
2. All-gather lengths Collect sequence lengths [sp_world_size]
3. All-gather tensors Collect full batches [B, S_i] per rank i
4. Shard on seq dim Split each full batch [B, S_full / sp_size]
5. Return local shard Yield to training loop [B, S_full / sp_size]

Related Pages

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment