Implementation:Deepspeedai DeepSpeed UlyssesSPDataLoaderAdapter Init
Overview
Concrete tool for adapting data loaders for sequence-parallel training provided by the DeepSpeed library.
Description
UlyssesSPDataLoaderAdapter wraps a standard PyTorch DataLoader and redistributes sequences across SP ranks. On each call to __next__, it:
- Calls
next()on the wrapped data loader to get a batch from the current rank - Gathers all batches across the SP group via
all_gather - For each gathered batch, shards every tensor on the sequence dimension (dim=1) into
sp_world_sizechunks - Shifts labels for causal language modeling (appends
-100padding and shifts by one position) - Stores the local rank's shard for each batch as a micro-batch
- Returns micro-batches one at a time on subsequent
__next__calls
The adapter expects batch dictionaries containing at minimum input_ids, position_ids, and labels keys. Tensors must have shape [batch_size, seqlen, ...] with sharding on the seqlen (dim=1) dimension. Non-tensor entries are copied unchanged.
After sharding, tensors are moved to CPU to minimize GPU memory pressure (critical at very long sequence lengths). The __len__ method returns len(dl) * sp_world_size since each original sample produces sp_world_size micro-batches.
Code Reference
- Repository: https://github.com/deepspeedai/DeepSpeed
- File:
deepspeed/runtime/sequence_parallel/ulysses_sp.py - Lines: L487-628
Signature
class UlyssesSPDataLoaderAdapter:
def __init__(
self,
dl: DataLoader,
sp_rank: int,
sp_group: ProcessGroup,
sp_world_size: int,
device: torch.device,
)
Import
from deepspeed.runtime.sequence_parallel.ulysses_sp import UlyssesSPDataLoaderAdapter
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| dl | DataLoader | Yes | Base data loader yielding full-length sequence batches with input_ids, position_ids, and labels keys
|
| sp_rank | int | Yes | This rank's position within the SP group |
| sp_group | ProcessGroup | Yes | The sequence-parallel process group |
| sp_world_size | int | Yes | Number of ranks in the SP group |
| device | torch.device | Yes | Target CUDA device for communication |
Outputs
| Output | Type | Description |
|---|---|---|
| Iterator | dict | Yields sequence-sharded batch dictionaries; each GPU gets S/sp_size tokens. Labels are shifted and stored as shift_labels. The original labels key is removed.
|
Usage Example
from deepspeed.runtime.sequence_parallel.ulysses_sp import UlyssesSPDataLoaderAdapter
import deepspeed.runtime.sequence_parallel.parallel_state_sp as mpu
sp_group = mpu.get_sequence_parallel_group()
sp_rank = mpu.get_sequence_parallel_rank()
sp_world_size = mpu.get_sequence_parallel_world_size()
train_loader = UlyssesSPDataLoaderAdapter(
dl=base_dataloader,
sp_rank=sp_rank,
sp_group=sp_group,
sp_world_size=sp_world_size,
device=torch.cuda.current_device(),
)
for batch in train_loader:
# batch contains S/sp_size tokens per sample
# batch keys: input_ids, position_ids, shift_labels, ...
# Note: 'labels' has been removed and replaced with 'shift_labels'
batch = {k: v.to(device) for k, v in batch.items() if torch.is_tensor(v)}
outputs = engine(**batch, use_cache=False)
Related Pages
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/
- https://arxiv.org/abs/2506.13996
Last updated: 2026-02-09 00:00 GMT