Heuristic:Deepspeedai DeepSpeed Sequence Parallel PyTorch Version
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, Sequence_Parallelism |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
DeepSpeed Ulysses Sequence Parallelism requires PyTorch >= 2.3 to avoid rank indexing errors during the backward pass when `sp_size < world_size`.
Description
DeepSpeed's Ulysses Sequence Parallelism (SP) splits sequences across GPUs using all-to-all communication for attention computation. When used with PyTorch versions older than 2.3, a rank indexing bug can occur during the backward pass, specifically when the sequence parallel size is smaller than the total world size (i.e., when both data parallelism and sequence parallelism are active). This manifests as incorrect gradient routing between ranks. The issue is caused by how PyTorch < 2.3 handles process group indexing in autograd.
Usage
Use this heuristic when configuring Ulysses Sequence Parallelism for long-context training. If you must use PyTorch < 2.3, apply the weighted all-reduce workaround described in the DeepSpeed regression tests. Otherwise, upgrade to PyTorch >= 2.3 to avoid the issue entirely.
The Insight (Rule of Thumb)
- Action: Use PyTorch >= 2.3 for Ulysses Sequence Parallelism. If on older PyTorch, apply the weighted all-reduce workaround.
- Value: Minimum PyTorch version 2.3 for correct SP backward pass.
- Trade-off: Upgrading PyTorch may introduce other compatibility issues; weigh against the workaround complexity.
Reasoning
The Ulysses SP implementation uses `torch.distributed` all-to-all operations within sequence parallel groups. In the backward pass, autograd needs to correctly route gradients back through these operations. PyTorch < 2.3 had a bug in how it handled rank indexing within sub-groups, causing gradients to be sent to wrong ranks when `sp_size < world_size`. PyTorch 2.3 fixed the underlying process group indexing in autograd, resolving this issue.
Code Evidence
PyTorch version check from `deepspeed/runtime/engine.py:1468-1476`:
if self.sequence_parallel_size > 1:
# Inserted Warning for PyTorch < 2.3
if not required_torch_version(min_version=2.3):
logger.warning(
"DeepSpeed Sequence Parallelism (Ulysses) with PyTorch < 2.3 may encounter "
"rank indexing errors during backward pass when sp_size < world_size. "
"Please use the weighted all-reduce workaround shown in the regression test "
"(https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/"
"sequence_parallelism/test_ulysses.py) "
"or upgrade to PyTorch 2.3+.")