Principle:Deepspeedai DeepSpeed Sequence Parallel Attention
Overview
Distributing attention computation across GPUs by partitioning the sequence dimension, using all-to-all communication to redistribute between sequence-split and head-split representations.
Detailed Description
Ulysses sequence parallelism splits the input sequence across SP GPUs. Before attention, each GPU holds a chunk of the full sequence for all attention heads. An all-to-all communication redistributes this to: each GPU holds the full sequence for a subset of attention heads. After computing attention on the full sequence (per-head), another all-to-all redistributes back to sequence-partitioned form. This enables computing exact (non-approximate) attention on sequences much longer than a single GPU can handle.
The key insight is that attention is computed independently per head. By redistributing from sequence-partitioned to head-partitioned form, each GPU computes standard attention on the full sequence length but only for its assigned subset of heads. This avoids the approximation errors introduced by methods that partition attention across sequence chunks (such as ring attention), while keeping memory usage proportional to S/P per GPU.
The implementation handles modern attention variants including:
- Multi-Head Attention (MHA): Standard case where Q, K, V all have the same number of heads.
- Grouped-Query Attention (GQA): K and V have fewer heads than Q. When
sp_size > kv_head_count, KV heads are replicated to ensure each GPU receives at least one KV head. - Multi-Query Attention (MQA): A special case of GQA with a single KV head.
Theoretical Basis
For sequence length S across P GPUs:
- Each GPU initially holds
S/Ptokens with all H heads: shape[S/P, B, H, D] - All-to-all transforms to: each GPU holds all S tokens but only
H/Pheads: shape[S, B, H/P, D] - Standard attention is computed (exact, no approximation) per head on the full sequence
- All-to-all transforms back to sequence-partitioned form: shape
[S/P, B, H, D]
Communication cost: O(S * H * D / P) per all-to-all, which scales linearly with sequence length.
Mathematical Formulation
Attention is computed per-head on the full sequence:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
This is computed on the full sequence for each head assigned to the GPU. Total memory per GPU for activations: O(S * H * D / P).
KV Head Replication for GQA/MQA
When sp_size > kv_head_count, a replication factor r = sp_size / kv_head_count is computed. KV heads are replicated r times via repeat_interleave before the all-to-all, ensuring each GPU receives at least one KV head after redistribution.
Reference
- DeepSpeed-Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models (https://arxiv.org/abs/2309.14509)
- Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences (https://arxiv.org/abs/2506.13996)
Related Pages
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/
- https://arxiv.org/abs/2309.14509
- https://arxiv.org/abs/2506.13996
Last updated: 2026-02-09 00:00 GMT