Implementation:Deepspeedai DeepSpeed UlyssesSPAttentionHF Init

Overview

Concrete tool for wrapping HuggingFace attention layers with Ulysses sequence-parallel all-to-all communication provided by the DeepSpeed library.

Description

UlyssesSPAttentionHF wraps HuggingFace transformer attention layers to add all-to-all communication for sequence parallelism. It can be applied manually (wrapping individual attention functions) or automatically via register_with_transformers() which patches all attention layers in a HuggingFace model.

The class supports GQA/MQA (grouped-query and multi-query attention) via separate head counts, and variable-length sequences. When register_with_transformers() is called, it:

Initializes sequence-parallel process groups via parallel_state_sp.initialize_sequence_parallel()
Reads model configuration (head counts, head size, number of layers) from the HuggingFace config
Creates a UlyssesSPAttentionHF instance wrapping the specified core attention implementation
Overrides all entries in HuggingFace's ALL_ATTENTION_FUNCTIONS registry with the Ulysses wrapper, ensuring all attention layers in the model use SP communication
Returns an mpu object providing get_sequence_parallel_world_size(), get_sequence_parallel_group(), and related methods

This approach (overriding all attention function entries rather than registering a new one) ensures that HuggingFace's internal branching logic for specific attention implementations (e.g., special handling for flash_attention_2) is preserved.

Code Reference

Repository: https://github.com/deepspeedai/DeepSpeed
File: deepspeed/runtime/sequence_parallel/ulysses_sp.py
Lines: L49-484 (class), L90-160 (__init__), L355-484 (register_with_transformers)

Direct Init Signature

UlyssesSPAttentionHF(
    attn,                          # core attention function
    batch_size: int,               # micro batch size
    attn_head_count: int,          # total number of Q attention heads
    attn_head_size: int,           # size of each attention head
    kv_head_count: int,            # total number of KV heads
    num_hidden_layers: int,        # total number of transformer layers
    process_group: ProcessGroup,   # Ulysses SP process group
    seq_length_is_variable: bool = False,
    local_seq_length: int = None,
    global_seq_length: int = None,
    disable_in_eval: bool = False,
)

Auto-Registration Signature

UlyssesSPAttentionHF.register_with_transformers(
    model_name_or_path,            # model object, HF hub name, or local path
    core_attn_implementation,      # 'flash_attention_2' | 'flash_attention_3' | 'sdpa'
    sequence_parallel_size: int,
    micro_batch_size: int,
    seq_length: int = None,
    seq_length_is_variable: bool = True,
    disable_in_eval: bool = False,
) -> mpu

Import

from deepspeed.runtime.sequence_parallel.ulysses_sp import UlyssesSPAttentionHF

I/O Contract

Inputs (register_with_transformers)

Parameter	Type	Required	Description
model_name_or_path	str or model object	Yes	HuggingFace model name, local path, or pre-loaded model object
core_attn_implementation	str	Yes	One of `'flash_attention_2'`, `'flash_attention_3'`, or `'sdpa'`
sequence_parallel_size	int	Yes	Number of GPUs in each SP group
micro_batch_size	int	Yes	Micro batch size per GPU
seq_length	int	No	Fixed sequence length (required if `seq_length_is_variable=False`)
seq_length_is_variable	bool	No	Whether sequence length varies between batches (default: `True`)
disable_in_eval	bool	No	Skip SP operations during eval mode (default: `False`)

Outputs

Output	Type	Description
mpu	object	Model parallel unit providing `get_sequence_parallel_world_size()`, `get_sequence_parallel_group()`, `get_sequence_parallel_rank()`, and related methods

Usage Example

from deepspeed.runtime.sequence_parallel.ulysses_sp import UlyssesSPAttentionHF

# Auto-register with all attention layers in a HuggingFace model
mpu = UlyssesSPAttentionHF.register_with_transformers(
    model_name_or_path="meta-llama/Llama-2-7b-hf",
    core_attn_implementation="flash_attention_2",
    sequence_parallel_size=4,
    micro_batch_size=1,
    seq_length_is_variable=True,
    disable_in_eval=True,
)

# The mpu object is then passed to deepspeed.initialize()
engine, _, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config,
    mesh_param=(2, 4),
    mpu=mpu,
)

Related Pages

Principle:Deepspeedai_DeepSpeed_Sequence_Parallel_Attention

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment