Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed UlyssesSPDataLoaderAdapter Init

From Leeroopedia
Revision as of 14:47, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Deepspeedai_DeepSpeed_UlyssesSPDataLoaderAdapter_Init.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Overview

Concrete tool for adapting data loaders for sequence-parallel training provided by the DeepSpeed library.

Description

UlyssesSPDataLoaderAdapter wraps a standard PyTorch DataLoader and redistributes sequences across SP ranks. On each call to __next__, it:

  1. Calls next() on the wrapped data loader to get a batch from the current rank
  2. Gathers all batches across the SP group via all_gather
  3. For each gathered batch, shards every tensor on the sequence dimension (dim=1) into sp_world_size chunks
  4. Shifts labels for causal language modeling (appends -100 padding and shifts by one position)
  5. Stores the local rank's shard for each batch as a micro-batch
  6. Returns micro-batches one at a time on subsequent __next__ calls

The adapter expects batch dictionaries containing at minimum input_ids, position_ids, and labels keys. Tensors must have shape [batch_size, seqlen, ...] with sharding on the seqlen (dim=1) dimension. Non-tensor entries are copied unchanged.

After sharding, tensors are moved to CPU to minimize GPU memory pressure (critical at very long sequence lengths). The __len__ method returns len(dl) * sp_world_size since each original sample produces sp_world_size micro-batches.

Code Reference

Signature

class UlyssesSPDataLoaderAdapter:
    def __init__(
        self,
        dl: DataLoader,
        sp_rank: int,
        sp_group: ProcessGroup,
        sp_world_size: int,
        device: torch.device,
    )

Import

from deepspeed.runtime.sequence_parallel.ulysses_sp import UlyssesSPDataLoaderAdapter

I/O Contract

Inputs

Parameter Type Required Description
dl DataLoader Yes Base data loader yielding full-length sequence batches with input_ids, position_ids, and labels keys
sp_rank int Yes This rank's position within the SP group
sp_group ProcessGroup Yes The sequence-parallel process group
sp_world_size int Yes Number of ranks in the SP group
device torch.device Yes Target CUDA device for communication

Outputs

Output Type Description
Iterator dict Yields sequence-sharded batch dictionaries; each GPU gets S/sp_size tokens. Labels are shifted and stored as shift_labels. The original labels key is removed.

Usage Example

from deepspeed.runtime.sequence_parallel.ulysses_sp import UlyssesSPDataLoaderAdapter
import deepspeed.runtime.sequence_parallel.parallel_state_sp as mpu

sp_group = mpu.get_sequence_parallel_group()
sp_rank = mpu.get_sequence_parallel_rank()
sp_world_size = mpu.get_sequence_parallel_world_size()

train_loader = UlyssesSPDataLoaderAdapter(
    dl=base_dataloader,
    sp_rank=sp_rank,
    sp_group=sp_group,
    sp_world_size=sp_world_size,
    device=torch.cuda.current_device(),
)

for batch in train_loader:
    # batch contains S/sp_size tokens per sample
    # batch keys: input_ids, position_ids, shift_labels, ...
    # Note: 'labels' has been removed and replaced with 'shift_labels'
    batch = {k: v.to(device) for k, v in batch.items() if torch.is_tensor(v)}
    outputs = engine(**batch, use_cache=False)

Related Pages

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment