Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Deepspeedai DeepSpeed Sequence Parallel Long Context Training

From Leeroopedia


Knowledge Sources
Domains Distributed_Training, Sequence_Parallelism, Long_Context
Last Updated 2026-02-09 00:00 GMT

Overview

End-to-end process for training transformer models on extremely long sequences (100K+ tokens) using DeepSpeed Ulysses sequence parallelism, which partitions the sequence dimension across GPUs.

Description

This workflow covers training large language models on very long sequences using DeepSpeed's Ulysses sequence parallelism implementation. Unlike data parallelism (which splits batches) or tensor parallelism (which splits layers), sequence parallelism partitions the sequence dimension across GPUs. Each GPU processes a different portion of the full sequence, and all-to-all communication at attention layer boundaries enables each GPU to compute attention over the full sequence context. This approach enables training on sequences of 1 million+ tokens by distributing the quadratic attention memory cost across multiple GPUs. Ulysses can be combined with ZeRO-3 for additional memory savings and supports CPU offloading for resource-constrained environments.

Usage

Execute this workflow when you need to train models on extremely long input sequences that exceed single-GPU memory capacity due to the quadratic memory cost of attention. This is essential for applications requiring long-context understanding: processing entire documents, books, long-form scientific papers, genomic sequences, or multi-turn conversation histories. Use this when the sequence length is the primary memory bottleneck rather than model size.

Execution Steps

Step 1: Mesh Device Configuration

Configure the parallel mesh defining the data-parallel and sequence-parallel dimensions. The mesh determines how GPUs are organized: some GPUs share the same data samples but different sequence chunks (sequence parallel group), while others process different data samples (data parallel group). The product of sequence_parallel_size and data_parallel_size must equal the total number of GPUs.

Key considerations:

  • sequence_parallel_size determines how many GPUs share each sequence
  • data_parallel_size = total_gpus / sequence_parallel_size
  • Higher SP size enables longer sequences but increases communication
  • Mesh can be configured via config JSON or mesh_param argument

Step 2: Model Adaptation for Sequence Parallelism

Adapt the model's attention mechanism to work with sequence parallelism. For HuggingFace models, use the UlyssesSPAttentionHF wrapper that replaces standard attention with sequence-parallel-aware attention using all-to-all communication. The wrapper handles sequence chunking, all-to-all exchanges at QKV computation boundaries, and result gathering.

Key considerations:

  • Replace standard self-attention with UlyssesSPAttentionHF for HuggingFace models
  • The wrapper handles all-to-all communication for QKV redistribution automatically
  • Each GPU computes attention for the full sequence but only on its assigned heads
  • Compatible with flash attention and other optimized attention implementations

Step 3: DeepSpeed Initialization with Sequence Parallelism

Initialize DeepSpeed with the mesh configuration for sequence parallelism. Pass the mesh_param or include sequence_parallel_size and data_parallel_size in the configuration. DeepSpeed creates the appropriate communication groups and initializes the mesh device for coordinated sequence-parallel operations.

Key considerations:

  • mesh_param is a tuple: (data_parallel_size, sequence_parallel_size)
  • Alternatively, set data_parallel_size and sequence_parallel_size in the config JSON
  • Can be combined with ZeRO Stage 3 for additional memory optimization
  • CPU offloading (Ulysses-Offload) further extends memory capacity

Step 4: Long Sequence Data Preparation

Prepare training data with appropriately long sequences. Each GPU receives the full sequence, which is then split along the sequence dimension according to the mesh configuration. The data loader must handle sequences of the target length and any necessary padding or truncation.

Key considerations:

  • Input sequences are split across the sequence parallel dimension automatically
  • Each GPU processes sequence_length / SP_size tokens
  • Proper positional encoding handling is critical for split sequences
  • Tiled computation (ALST) can be used for sequences exceeding even SP capacity

Step 5: Training with Sequence Parallelism

Execute the training loop where each forward pass involves all-to-all communication at attention layer boundaries. The sequence dimension is redistributed so each GPU computes full attention for a subset of heads, then results are gathered back. The backward pass follows the same communication pattern in reverse.

Key considerations:

  • All-to-all communication happens twice per attention layer (before and after attention)
  • Communication volume scales linearly with sequence length (better than ring attention)
  • Gradient synchronization happens across both SP and DP groups
  • Monitor communication overhead vs computation to optimize SP size

Step 6: Long Context Evaluation and Deployment

Evaluate the trained model on long-context benchmarks and prepare for deployment. For inference, the model can be used with or without sequence parallelism depending on the deployment hardware and target sequence lengths.

Key considerations:

  • Validate long-context capability with appropriate benchmarks
  • For deployment, consider whether inference also needs sequence parallelism
  • Checkpoints are saved in standard format compatible with non-SP loading
  • Per-sequence-position metrics help verify attention patterns across the full context

Execution Diagram

GitHub URL

Workflow Repository