Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory Sequence Packing

From Leeroopedia
Revision as of 15:07, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Hiyouga_LLaMA_Factory_Sequence_Packing.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Training Efficiency, Attention Mechanisms
Last Updated 2026-02-06 19:00 GMT

Overview

Enables sequence packing with block diagonal attention masking, allowing multiple short sequences to be packed into a single batch without cross-sequence attention contamination.

Description

This module implements sequence packing support for FlashAttention's variable-length API. get_seqlens_in_batch extracts individual sequence lengths from packed attention masks where different sequences are labeled with incrementing integers (e.g., [1, 1, 2, 2, 2] represents two sequences of lengths 2 and 3). get_unpad_data computes the unpadded token indices, cumulative sequence lengths, and maximum sequence length needed by FlashAttention's varlen functions. configure_packing monkey-patches HuggingFace Transformers' internal _get_unpad_data function with the custom implementation, enabling block diagonal attention that prevents tokens from different packed sequences from attending to each other. This approach is based on the functionary library's packing implementation.

Usage

Use this module by setting block_diag_attn: true in the model arguments during training. It requires FlashAttention-2 to be installed and active. The data collator must produce attention masks with incrementing integer labels for packed sequences (rather than simple 0/1 masks). This is activated automatically during model configuration when packing is enabled.

Code Reference

Source Location

Signature

def get_seqlens_in_batch(
    attention_mask: "torch.Tensor",
) -> "torch.Tensor":
    ...

def get_unpad_data(
    attention_mask: "torch.Tensor",
) -> tuple["torch.Tensor", "torch.Tensor", int]:
    ...

def configure_packing(
    model_args: "ModelArguments",
    is_trainable: bool,
) -> None:
    ...

Import

from llamafactory.model.model_utils.packing import get_seqlens_in_batch, get_unpad_data, configure_packing

I/O Contract

Inputs

Name Type Required Description
attention_mask torch.Tensor Yes (for get_seqlens_in_batch, get_unpad_data) Packed attention mask with shape [batch_size, seq_len] where values are incrementing integers per sequence (0 for padding)
model_args ModelArguments Yes (for configure_packing) Model arguments; must have block_diag_attn set to True for packing to be enabled
is_trainable bool Yes (for configure_packing) Whether the model is in training mode; packing is only configured during training

Outputs

Name Type Description
seqlens torch.Tensor 1-D tensor of individual sequence lengths extracted from the packed attention mask
indices torch.Tensor Indices of non-masked tokens from the flattened batch, used for unpadding
cu_seqlens torch.Tensor Cumulative sequence lengths (int32), starting from 0, for FlashAttention varlen API
max_seqlen_in_batch int The length of the longest individual sequence in the batch

Usage Examples

from llamafactory.model.model_utils.packing import get_seqlens_in_batch, get_unpad_data
import torch

# Example packed attention mask:
# Batch of 2, each with packed sequences labeled by incrementing integers
attention_mask = torch.tensor([
    [1, 1, 2, 2, 2, 0],  # Two sequences: length 2, length 3, padding 1
    [1, 2, 2, 3, 3, 3],  # Three sequences: length 1, length 2, length 3
])

# Get individual sequence lengths
seqlens = get_seqlens_in_batch(attention_mask)
# Result: tensor([2, 3, 1, 2, 3])

# Get unpadding data for FlashAttention
indices, cu_seqlens, max_seqlen = get_unpad_data(attention_mask)
# indices: tensor([0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11])
# cu_seqlens: tensor([0, 2, 5, 6, 8, 11])
# max_seqlen: 3

# Enable packing (called automatically during model setup)
from llamafactory.model.model_utils.packing import configure_packing
configure_packing(model_args, is_trainable=True)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment