Implementation:Deepspeedai DeepSpeed UlyssesSPAttentionHF Init
Overview
Concrete tool for wrapping HuggingFace attention layers with Ulysses sequence-parallel all-to-all communication provided by the DeepSpeed library.
Description
UlyssesSPAttentionHF wraps HuggingFace transformer attention layers to add all-to-all communication for sequence parallelism. It can be applied manually (wrapping individual attention functions) or automatically via register_with_transformers() which patches all attention layers in a HuggingFace model.
The class supports GQA/MQA (grouped-query and multi-query attention) via separate head counts, and variable-length sequences. When register_with_transformers() is called, it:
- Initializes sequence-parallel process groups via
parallel_state_sp.initialize_sequence_parallel() - Reads model configuration (head counts, head size, number of layers) from the HuggingFace config
- Creates a
UlyssesSPAttentionHFinstance wrapping the specified core attention implementation - Overrides all entries in HuggingFace's
ALL_ATTENTION_FUNCTIONSregistry with the Ulysses wrapper, ensuring all attention layers in the model use SP communication - Returns an
mpuobject providingget_sequence_parallel_world_size(),get_sequence_parallel_group(), and related methods
This approach (overriding all attention function entries rather than registering a new one) ensures that HuggingFace's internal branching logic for specific attention implementations (e.g., special handling for flash_attention_2) is preserved.
Code Reference
- Repository: https://github.com/deepspeedai/DeepSpeed
- File:
deepspeed/runtime/sequence_parallel/ulysses_sp.py - Lines: L49-484 (class), L90-160 (
__init__), L355-484 (register_with_transformers)
Direct Init Signature
UlyssesSPAttentionHF(
attn, # core attention function
batch_size: int, # micro batch size
attn_head_count: int, # total number of Q attention heads
attn_head_size: int, # size of each attention head
kv_head_count: int, # total number of KV heads
num_hidden_layers: int, # total number of transformer layers
process_group: ProcessGroup, # Ulysses SP process group
seq_length_is_variable: bool = False,
local_seq_length: int = None,
global_seq_length: int = None,
disable_in_eval: bool = False,
)
Auto-Registration Signature
UlyssesSPAttentionHF.register_with_transformers(
model_name_or_path, # model object, HF hub name, or local path
core_attn_implementation, # 'flash_attention_2' | 'flash_attention_3' | 'sdpa'
sequence_parallel_size: int,
micro_batch_size: int,
seq_length: int = None,
seq_length_is_variable: bool = True,
disable_in_eval: bool = False,
) -> mpu
Import
from deepspeed.runtime.sequence_parallel.ulysses_sp import UlyssesSPAttentionHF
I/O Contract
Inputs (register_with_transformers)
| Parameter | Type | Required | Description |
|---|---|---|---|
| model_name_or_path | str or model object | Yes | HuggingFace model name, local path, or pre-loaded model object |
| core_attn_implementation | str | Yes | One of 'flash_attention_2', 'flash_attention_3', or 'sdpa'
|
| sequence_parallel_size | int | Yes | Number of GPUs in each SP group |
| micro_batch_size | int | Yes | Micro batch size per GPU |
| seq_length | int | No | Fixed sequence length (required if seq_length_is_variable=False)
|
| seq_length_is_variable | bool | No | Whether sequence length varies between batches (default: True)
|
| disable_in_eval | bool | No | Skip SP operations during eval mode (default: False)
|
Outputs
| Output | Type | Description |
|---|---|---|
| mpu | object | Model parallel unit providing get_sequence_parallel_world_size(), get_sequence_parallel_group(), get_sequence_parallel_rank(), and related methods
|
Usage Example
from deepspeed.runtime.sequence_parallel.ulysses_sp import UlyssesSPAttentionHF
# Auto-register with all attention layers in a HuggingFace model
mpu = UlyssesSPAttentionHF.register_with_transformers(
model_name_or_path="meta-llama/Llama-2-7b-hf",
core_attn_implementation="flash_attention_2",
sequence_parallel_size=4,
micro_batch_size=1,
seq_length_is_variable=True,
disable_in_eval=True,
)
# The mpu object is then passed to deepspeed.initialize()
engine, _, _, _ = deepspeed.initialize(
model=model,
config=ds_config,
mesh_param=(2, 4),
mpu=mpu,
)
Related Pages
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/
- https://arxiv.org/abs/2309.14509
Last updated: 2026-02-09 00:00 GMT