Principle:Deepspeedai DeepSpeed Sequence Parallel Attention

Overview

Distributing attention computation across GPUs by partitioning the sequence dimension, using all-to-all communication to redistribute between sequence-split and head-split representations.

Detailed Description

Ulysses sequence parallelism splits the input sequence across SP GPUs. Before attention, each GPU holds a chunk of the full sequence for all attention heads. An all-to-all communication redistributes this to: each GPU holds the full sequence for a subset of attention heads. After computing attention on the full sequence (per-head), another all-to-all redistributes back to sequence-partitioned form. This enables computing exact (non-approximate) attention on sequences much longer than a single GPU can handle.

The key insight is that attention is computed independently per head. By redistributing from sequence-partitioned to head-partitioned form, each GPU computes standard attention on the full sequence length but only for its assigned subset of heads. This avoids the approximation errors introduced by methods that partition attention across sequence chunks (such as ring attention), while keeping memory usage proportional to S/P per GPU.

The implementation handles modern attention variants including:

Multi-Head Attention (MHA): Standard case where Q, K, V all have the same number of heads.
Grouped-Query Attention (GQA): K and V have fewer heads than Q. When sp_size > kv_head_count, KV heads are replicated to ensure each GPU receives at least one KV head.
Multi-Query Attention (MQA): A special case of GQA with a single KV head.

Theoretical Basis

For sequence length S across P GPUs:

Each GPU initially holds S/P tokens with all H heads: shape [S/P, B, H, D]
All-to-all transforms to: each GPU holds all S tokens but only H/P heads: shape [S, B, H/P, D]
Standard attention is computed (exact, no approximation) per head on the full sequence
All-to-all transforms back to sequence-partitioned form: shape [S/P, B, H, D]

Communication cost: O(S * H * D / P) per all-to-all, which scales linearly with sequence length.

Mathematical Formulation

Attention is computed per-head on the full sequence:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V

This is computed on the full sequence for each head assigned to the GPU. Total memory per GPU for activations: O(S * H * D / P).

KV Head Replication for GQA/MQA

When sp_size > kv_head_count, a replication factor r = sp_size / kv_head_count is computed. KV heads are replicated r times via repeat_interleave before the all-to-all, ensuring each GPU receives at least one KV head after redistribution.

Reference

DeepSpeed-Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models (https://arxiv.org/abs/2309.14509)
Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences (https://arxiv.org/abs/2506.13996)

Related Pages

Implementation:Deepspeedai_DeepSpeed_UlyssesSPAttentionHF_Init

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment