Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory LongLoRA

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Attention Mechanisms
Last Updated 2026-02-06 19:00 GMT

Overview

Implementation of shifted sparse attention (S2-Attn) from the LongLoRA paper for efficient long-context training in LLaMA-Factory.

Description

The longlora module implements the Shift Short Attention (S2-Attn) mechanism proposed by the LongLoRA paper. It works by monkey-patching three LLaMA attention forward methods -- vanilla attention, Flash Attention 2, and SDPA (Scaled Dot-Product Attention) -- to split attention heads into two groups, apply cyclic position shifts to half the heads, compute attention within local windows (defined by group_size_ratio), and then shift back. This reduces the attention complexity from O(n^2) to O(n * window_size), enabling efficient training with much longer context lengths. The module requires transformers version >=4.45.0 and <4.48.0.

Usage

Activated by setting shift_attn=True in ModelArguments. The configure_longlora function is called during model setup and applies the attention patches only during training and only for supported model types (those listed in SUPPORTED_CLASS_FOR_S2ATTN).

Code Reference

Source Location

Signature

def llama_attention_forward(
    self: "LlamaAttention",
    hidden_states: "torch.Tensor",
    attention_mask: Optional["torch.Tensor"] = None,
    position_ids: Optional["torch.LongTensor"] = None,
    past_key_value: Optional["Cache"] = None,
    output_attentions: bool = False,
    cache_position: Optional["torch.LongTensor"] = None,
    position_embeddings: Optional[tuple["torch.Tensor", "torch.Tensor"]] = None,
    **kwargs,
) -> tuple["torch.Tensor", Optional["torch.Tensor"], Optional[tuple["torch.Tensor"]]]: ...

def llama_flash_attention_2_forward(
    self: "LlamaFlashAttention2",
    hidden_states: "torch.Tensor",
    ...
) -> tuple["torch.Tensor", Optional["torch.Tensor"], Optional[tuple["torch.Tensor"]]]: ...

def llama_sdpa_attention_forward(
    self: "LlamaSdpaAttention",
    hidden_states: "torch.Tensor",
    ...
) -> tuple["torch.Tensor", Optional["torch.Tensor"], Optional[tuple["torch.Tensor"]]]: ...

def configure_longlora(
    config: "PretrainedConfig",
    model_args: "ModelArguments",
    is_trainable: bool,
) -> None:
    """Configure LongLoRA shifted attention if enabled."""

Import

from llamafactory.model.model_utils.longlora import configure_longlora

I/O Contract

Inputs

Name Type Required Description
config PretrainedConfig Yes Model configuration object; group_size_ratio is set on it
model_args ModelArguments Yes Must have shift_attn=True to activate
is_trainable bool Yes S2-Attn only applied during training (True)
hidden_states torch.Tensor Yes (forward) Input tensor of shape (batch, seq_len, hidden_size)
attention_mask torch.Tensor No (forward) Causal attention mask

Outputs

Name Type Description
configure_longlora None Side effect: patches LlamaAttention forward methods and sets group_size_ratio on config
attention forward tuple (attn_output, attn_weights, past_key_value) with shifted sparse attention applied

Usage Examples

from transformers import AutoConfig
from llamafactory.hparams import ModelArguments
from llamafactory.model.model_utils.longlora import configure_longlora

config = AutoConfig.from_pretrained("meta-llama/Llama-2-7b-hf")
model_args = ModelArguments(
    model_name_or_path="meta-llama/Llama-2-7b-hf",
    shift_attn=True,
)

# Apply LongLoRA patches (only during training)
configure_longlora(config, model_args, is_trainable=True)
# config.group_size_ratio is now set to 0.25
# LlamaAttention.forward methods are monkey-patched with shifted attention

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment