Implementation:Deepspeedai DeepSpeed UlyssesSPDataLoaderAdapter Init

Overview

Concrete tool for adapting data loaders for sequence-parallel training provided by the DeepSpeed library.

Description

UlyssesSPDataLoaderAdapter wraps a standard PyTorch DataLoader and redistributes sequences across SP ranks. On each call to __next__, it:

Calls next() on the wrapped data loader to get a batch from the current rank
Gathers all batches across the SP group via all_gather
For each gathered batch, shards every tensor on the sequence dimension (dim=1) into sp_world_size chunks
Shifts labels for causal language modeling (appends -100 padding and shifts by one position)
Stores the local rank's shard for each batch as a micro-batch
Returns micro-batches one at a time on subsequent __next__ calls

The adapter expects batch dictionaries containing at minimum input_ids, position_ids, and labels keys. Tensors must have shape [batch_size, seqlen, ...] with sharding on the seqlen (dim=1) dimension. Non-tensor entries are copied unchanged.

After sharding, tensors are moved to CPU to minimize GPU memory pressure (critical at very long sequence lengths). The __len__ method returns len(dl) * sp_world_size since each original sample produces sp_world_size micro-batches.

Code Reference

Repository: https://github.com/deepspeedai/DeepSpeed
File: deepspeed/runtime/sequence_parallel/ulysses_sp.py
Lines: L487-628

Signature

class UlyssesSPDataLoaderAdapter:
    def __init__(
        self,
        dl: DataLoader,
        sp_rank: int,
        sp_group: ProcessGroup,
        sp_world_size: int,
        device: torch.device,
    )

Import

from deepspeed.runtime.sequence_parallel.ulysses_sp import UlyssesSPDataLoaderAdapter

I/O Contract

Inputs

Parameter	Type	Required	Description
dl	DataLoader	Yes	Base data loader yielding full-length sequence batches with `input_ids`, `position_ids`, and `labels` keys
sp_rank	int	Yes	This rank's position within the SP group
sp_group	ProcessGroup	Yes	The sequence-parallel process group
sp_world_size	int	Yes	Number of ranks in the SP group
device	torch.device	Yes	Target CUDA device for communication

Outputs

Output	Type	Description
Iterator	dict	Yields sequence-sharded batch dictionaries; each GPU gets `S/sp_size` tokens. Labels are shifted and stored as `shift_labels`. The original `labels` key is removed.

Usage Example

from deepspeed.runtime.sequence_parallel.ulysses_sp import UlyssesSPDataLoaderAdapter
import deepspeed.runtime.sequence_parallel.parallel_state_sp as mpu

sp_group = mpu.get_sequence_parallel_group()
sp_rank = mpu.get_sequence_parallel_rank()
sp_world_size = mpu.get_sequence_parallel_world_size()

train_loader = UlyssesSPDataLoaderAdapter(
    dl=base_dataloader,
    sp_rank=sp_rank,
    sp_group=sp_group,
    sp_world_size=sp_world_size,
    device=torch.cuda.current_device(),
)

for batch in train_loader:
    # batch contains S/sp_size tokens per sample
    # batch keys: input_ids, position_ids, shift_labels, ...
    # Note: 'labels' has been removed and replaced with 'shift_labels'
    batch = {k: v.to(device) for k, v in batch.items() if torch.is_tensor(v)}
    outputs = engine(**batch, use_cache=False)

Related Pages

Principle:Deepspeedai_DeepSpeed_SP_Data_Preparation

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment