Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed UlyssesSPAttentionHF Init

From Leeroopedia
Revision as of 14:47, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Deepspeedai_DeepSpeed_UlyssesSPAttentionHF_Init.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Overview

Concrete tool for wrapping HuggingFace attention layers with Ulysses sequence-parallel all-to-all communication provided by the DeepSpeed library.

Description

UlyssesSPAttentionHF wraps HuggingFace transformer attention layers to add all-to-all communication for sequence parallelism. It can be applied manually (wrapping individual attention functions) or automatically via register_with_transformers() which patches all attention layers in a HuggingFace model.

The class supports GQA/MQA (grouped-query and multi-query attention) via separate head counts, and variable-length sequences. When register_with_transformers() is called, it:

  1. Initializes sequence-parallel process groups via parallel_state_sp.initialize_sequence_parallel()
  2. Reads model configuration (head counts, head size, number of layers) from the HuggingFace config
  3. Creates a UlyssesSPAttentionHF instance wrapping the specified core attention implementation
  4. Overrides all entries in HuggingFace's ALL_ATTENTION_FUNCTIONS registry with the Ulysses wrapper, ensuring all attention layers in the model use SP communication
  5. Returns an mpu object providing get_sequence_parallel_world_size(), get_sequence_parallel_group(), and related methods

This approach (overriding all attention function entries rather than registering a new one) ensures that HuggingFace's internal branching logic for specific attention implementations (e.g., special handling for flash_attention_2) is preserved.

Code Reference

Direct Init Signature

UlyssesSPAttentionHF(
    attn,                          # core attention function
    batch_size: int,               # micro batch size
    attn_head_count: int,          # total number of Q attention heads
    attn_head_size: int,           # size of each attention head
    kv_head_count: int,            # total number of KV heads
    num_hidden_layers: int,        # total number of transformer layers
    process_group: ProcessGroup,   # Ulysses SP process group
    seq_length_is_variable: bool = False,
    local_seq_length: int = None,
    global_seq_length: int = None,
    disable_in_eval: bool = False,
)

Auto-Registration Signature

UlyssesSPAttentionHF.register_with_transformers(
    model_name_or_path,            # model object, HF hub name, or local path
    core_attn_implementation,      # 'flash_attention_2' | 'flash_attention_3' | 'sdpa'
    sequence_parallel_size: int,
    micro_batch_size: int,
    seq_length: int = None,
    seq_length_is_variable: bool = True,
    disable_in_eval: bool = False,
) -> mpu

Import

from deepspeed.runtime.sequence_parallel.ulysses_sp import UlyssesSPAttentionHF

I/O Contract

Inputs (register_with_transformers)

Parameter Type Required Description
model_name_or_path str or model object Yes HuggingFace model name, local path, or pre-loaded model object
core_attn_implementation str Yes One of 'flash_attention_2', 'flash_attention_3', or 'sdpa'
sequence_parallel_size int Yes Number of GPUs in each SP group
micro_batch_size int Yes Micro batch size per GPU
seq_length int No Fixed sequence length (required if seq_length_is_variable=False)
seq_length_is_variable bool No Whether sequence length varies between batches (default: True)
disable_in_eval bool No Skip SP operations during eval mode (default: False)

Outputs

Output Type Description
mpu object Model parallel unit providing get_sequence_parallel_world_size(), get_sequence_parallel_group(), get_sequence_parallel_rank(), and related methods

Usage Example

from deepspeed.runtime.sequence_parallel.ulysses_sp import UlyssesSPAttentionHF

# Auto-register with all attention layers in a HuggingFace model
mpu = UlyssesSPAttentionHF.register_with_transformers(
    model_name_or_path="meta-llama/Llama-2-7b-hf",
    core_attn_implementation="flash_attention_2",
    sequence_parallel_size=4,
    micro_batch_size=1,
    seq_length_is_variable=True,
    disable_in_eval=True,
)

# The mpu object is then passed to deepspeed.initialize()
engine, _, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config,
    mesh_param=(2, 4),
    mpu=mpu,
)

Related Pages

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment