Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Deepspeedai DeepSpeed Sequence Parallel PyTorch Version

From Leeroopedia
Revision as of 10:55, 16 February 2026 by Admin (talk | contribs) (Auto-imported from heuristics/Deepspeedai_DeepSpeed_Sequence_Parallel_PyTorch_Version.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)



Knowledge Sources
Domains Distributed_Training, Sequence_Parallelism
Last Updated 2026-02-09 00:00 GMT

Overview

DeepSpeed Ulysses Sequence Parallelism requires PyTorch >= 2.3 to avoid rank indexing errors during the backward pass when `sp_size < world_size`.

Description

DeepSpeed's Ulysses Sequence Parallelism (SP) splits sequences across GPUs using all-to-all communication for attention computation. When used with PyTorch versions older than 2.3, a rank indexing bug can occur during the backward pass, specifically when the sequence parallel size is smaller than the total world size (i.e., when both data parallelism and sequence parallelism are active). This manifests as incorrect gradient routing between ranks. The issue is caused by how PyTorch < 2.3 handles process group indexing in autograd.

Usage

Use this heuristic when configuring Ulysses Sequence Parallelism for long-context training. If you must use PyTorch < 2.3, apply the weighted all-reduce workaround described in the DeepSpeed regression tests. Otherwise, upgrade to PyTorch >= 2.3 to avoid the issue entirely.

The Insight (Rule of Thumb)

  • Action: Use PyTorch >= 2.3 for Ulysses Sequence Parallelism. If on older PyTorch, apply the weighted all-reduce workaround.
  • Value: Minimum PyTorch version 2.3 for correct SP backward pass.
  • Trade-off: Upgrading PyTorch may introduce other compatibility issues; weigh against the workaround complexity.

Reasoning

The Ulysses SP implementation uses `torch.distributed` all-to-all operations within sequence parallel groups. In the backward pass, autograd needs to correctly route gradients back through these operations. PyTorch < 2.3 had a bug in how it handled rank indexing within sub-groups, causing gradients to be sent to wrong ranks when `sp_size < world_size`. PyTorch 2.3 fixed the underlying process group indexing in autograd, resolving this issue.

Code Evidence

PyTorch version check from `deepspeed/runtime/engine.py:1468-1476`:

if self.sequence_parallel_size > 1:
    # Inserted Warning for PyTorch < 2.3
    if not required_torch_version(min_version=2.3):
        logger.warning(
            "DeepSpeed Sequence Parallelism (Ulysses) with PyTorch < 2.3 may encounter "
            "rank indexing errors during backward pass when sp_size < world_size. "
            "Please use the weighted all-reduce workaround shown in the regression test "
            "(https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/"
            "sequence_parallelism/test_ulysses.py) "
            "or upgrade to PyTorch 2.3+.")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment