Implementation:Hiyouga LLaMA Factory Sequence Packing

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Training Efficiency, Attention Mechanisms
Last Updated	2026-02-06 19:00 GMT

Overview

Enables sequence packing with block diagonal attention masking, allowing multiple short sequences to be packed into a single batch without cross-sequence attention contamination.

Description

This module implements sequence packing support for FlashAttention's variable-length API. get_seqlens_in_batch extracts individual sequence lengths from packed attention masks where different sequences are labeled with incrementing integers (e.g., [1, 1, 2, 2, 2] represents two sequences of lengths 2 and 3). get_unpad_data computes the unpadded token indices, cumulative sequence lengths, and maximum sequence length needed by FlashAttention's varlen functions. configure_packing monkey-patches HuggingFace Transformers' internal _get_unpad_data function with the custom implementation, enabling block diagonal attention that prevents tokens from different packed sequences from attending to each other. This approach is based on the functionary library's packing implementation.

Usage

Use this module by setting block_diag_attn: true in the model arguments during training. It requires FlashAttention-2 to be installed and active. The data collator must produce attention masks with incrementing integer labels for packed sequences (rather than simple 0/1 masks). This is activated automatically during model configuration when packing is enabled.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/model/model_utils/packing.py
Lines: 1-117

Signature

def get_seqlens_in_batch(
    attention_mask: "torch.Tensor",
) -> "torch.Tensor":
    ...

def get_unpad_data(
    attention_mask: "torch.Tensor",
) -> tuple["torch.Tensor", "torch.Tensor", int]:
    ...

def configure_packing(
    model_args: "ModelArguments",
    is_trainable: bool,
) -> None:
    ...

Import

from llamafactory.model.model_utils.packing import get_seqlens_in_batch, get_unpad_data, configure_packing

I/O Contract

Inputs

Name	Type	Required	Description
attention_mask	torch.Tensor	Yes (for get_seqlens_in_batch, get_unpad_data)	Packed attention mask with shape [batch_size, seq_len] where values are incrementing integers per sequence (0 for padding)
model_args	ModelArguments	Yes (for configure_packing)	Model arguments; must have block_diag_attn set to True for packing to be enabled
is_trainable	bool	Yes (for configure_packing)	Whether the model is in training mode; packing is only configured during training

Outputs

Name	Type	Description
seqlens	torch.Tensor	1-D tensor of individual sequence lengths extracted from the packed attention mask
indices	torch.Tensor	Indices of non-masked tokens from the flattened batch, used for unpadding
cu_seqlens	torch.Tensor	Cumulative sequence lengths (int32), starting from 0, for FlashAttention varlen API
max_seqlen_in_batch	int	The length of the longest individual sequence in the batch

Usage Examples

from llamafactory.model.model_utils.packing import get_seqlens_in_batch, get_unpad_data
import torch

# Example packed attention mask:
# Batch of 2, each with packed sequences labeled by incrementing integers
attention_mask = torch.tensor([
    [1, 1, 2, 2, 2, 0],  # Two sequences: length 2, length 3, padding 1
    [1, 2, 2, 3, 3, 3],  # Three sequences: length 1, length 2, length 3
])

# Get individual sequence lengths
seqlens = get_seqlens_in_batch(attention_mask)
# Result: tensor([2, 3, 1, 2, 3])

# Get unpadding data for FlashAttention
indices, cu_seqlens, max_seqlen = get_unpad_data(attention_mask)
# indices: tensor([0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11])
# cu_seqlens: tensor([0, 2, 5, 6, 8, 11])
# max_seqlen: 3

# Enable packing (called automatically during model setup)
from llamafactory.model.model_utils.packing import configure_packing
configure_packing(model_args, is_trainable=True)

Related Pages

Hiyouga_LLaMA_Factory_Attention_Config - Attention configuration; packing requires FlashAttention-2 for varlen support
Hiyouga_LLaMA_Factory_Model_Loader - Model loader that invokes configure_packing during model setup
Hiyouga_LLaMA_Factory_Gradient_Checkpointing - Complementary memory optimization used alongside sequence packing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment