Implementation:Hiyouga LLaMA Factory LongLoRA
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Attention Mechanisms |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Implementation of shifted sparse attention (S2-Attn) from the LongLoRA paper for efficient long-context training in LLaMA-Factory.
Description
The longlora module implements the Shift Short Attention (S2-Attn) mechanism proposed by the LongLoRA paper. It works by monkey-patching three LLaMA attention forward methods -- vanilla attention, Flash Attention 2, and SDPA (Scaled Dot-Product Attention) -- to split attention heads into two groups, apply cyclic position shifts to half the heads, compute attention within local windows (defined by group_size_ratio), and then shift back. This reduces the attention complexity from O(n^2) to O(n * window_size), enabling efficient training with much longer context lengths. The module requires transformers version >=4.45.0 and <4.48.0.
Usage
Activated by setting shift_attn=True in ModelArguments. The configure_longlora function is called during model setup and applies the attention patches only during training and only for supported model types (those listed in SUPPORTED_CLASS_FOR_S2ATTN).
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/model/model_utils/longlora.py
- Lines: 1-370
Signature
def llama_attention_forward(
self: "LlamaAttention",
hidden_states: "torch.Tensor",
attention_mask: Optional["torch.Tensor"] = None,
position_ids: Optional["torch.LongTensor"] = None,
past_key_value: Optional["Cache"] = None,
output_attentions: bool = False,
cache_position: Optional["torch.LongTensor"] = None,
position_embeddings: Optional[tuple["torch.Tensor", "torch.Tensor"]] = None,
**kwargs,
) -> tuple["torch.Tensor", Optional["torch.Tensor"], Optional[tuple["torch.Tensor"]]]: ...
def llama_flash_attention_2_forward(
self: "LlamaFlashAttention2",
hidden_states: "torch.Tensor",
...
) -> tuple["torch.Tensor", Optional["torch.Tensor"], Optional[tuple["torch.Tensor"]]]: ...
def llama_sdpa_attention_forward(
self: "LlamaSdpaAttention",
hidden_states: "torch.Tensor",
...
) -> tuple["torch.Tensor", Optional["torch.Tensor"], Optional[tuple["torch.Tensor"]]]: ...
def configure_longlora(
config: "PretrainedConfig",
model_args: "ModelArguments",
is_trainable: bool,
) -> None:
"""Configure LongLoRA shifted attention if enabled."""
Import
from llamafactory.model.model_utils.longlora import configure_longlora
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | PretrainedConfig | Yes | Model configuration object; group_size_ratio is set on it |
| model_args | ModelArguments | Yes | Must have shift_attn=True to activate |
| is_trainable | bool | Yes | S2-Attn only applied during training (True) |
| hidden_states | torch.Tensor | Yes (forward) | Input tensor of shape (batch, seq_len, hidden_size) |
| attention_mask | torch.Tensor | No (forward) | Causal attention mask |
Outputs
| Name | Type | Description |
|---|---|---|
| configure_longlora | None | Side effect: patches LlamaAttention forward methods and sets group_size_ratio on config |
| attention forward | tuple | (attn_output, attn_weights, past_key_value) with shifted sparse attention applied |
Usage Examples
from transformers import AutoConfig
from llamafactory.hparams import ModelArguments
from llamafactory.model.model_utils.longlora import configure_longlora
config = AutoConfig.from_pretrained("meta-llama/Llama-2-7b-hf")
model_args = ModelArguments(
model_name_or_path="meta-llama/Llama-2-7b-hf",
shift_attn=True,
)
# Apply LongLoRA patches (only during training)
configure_longlora(config, model_args, is_trainable=True)
# config.group_size_ratio is now set to 0.25
# LlamaAttention.forward methods are monkey-patched with shifted attention