Implementation:Hiyouga LLaMA Factory Sequence Packing
| Knowledge Sources | |
|---|---|
| Domains | Training Efficiency, Attention Mechanisms |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Enables sequence packing with block diagonal attention masking, allowing multiple short sequences to be packed into a single batch without cross-sequence attention contamination.
Description
This module implements sequence packing support for FlashAttention's variable-length API. get_seqlens_in_batch extracts individual sequence lengths from packed attention masks where different sequences are labeled with incrementing integers (e.g., [1, 1, 2, 2, 2] represents two sequences of lengths 2 and 3). get_unpad_data computes the unpadded token indices, cumulative sequence lengths, and maximum sequence length needed by FlashAttention's varlen functions. configure_packing monkey-patches HuggingFace Transformers' internal _get_unpad_data function with the custom implementation, enabling block diagonal attention that prevents tokens from different packed sequences from attending to each other. This approach is based on the functionary library's packing implementation.
Usage
Use this module by setting block_diag_attn: true in the model arguments during training. It requires FlashAttention-2 to be installed and active. The data collator must produce attention masks with incrementing integer labels for packed sequences (rather than simple 0/1 masks). This is activated automatically during model configuration when packing is enabled.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/model/model_utils/packing.py
- Lines: 1-117
Signature
def get_seqlens_in_batch(
attention_mask: "torch.Tensor",
) -> "torch.Tensor":
...
def get_unpad_data(
attention_mask: "torch.Tensor",
) -> tuple["torch.Tensor", "torch.Tensor", int]:
...
def configure_packing(
model_args: "ModelArguments",
is_trainable: bool,
) -> None:
...
Import
from llamafactory.model.model_utils.packing import get_seqlens_in_batch, get_unpad_data, configure_packing
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| attention_mask | torch.Tensor | Yes (for get_seqlens_in_batch, get_unpad_data) | Packed attention mask with shape [batch_size, seq_len] where values are incrementing integers per sequence (0 for padding) |
| model_args | ModelArguments | Yes (for configure_packing) | Model arguments; must have block_diag_attn set to True for packing to be enabled |
| is_trainable | bool | Yes (for configure_packing) | Whether the model is in training mode; packing is only configured during training |
Outputs
| Name | Type | Description |
|---|---|---|
| seqlens | torch.Tensor | 1-D tensor of individual sequence lengths extracted from the packed attention mask |
| indices | torch.Tensor | Indices of non-masked tokens from the flattened batch, used for unpadding |
| cu_seqlens | torch.Tensor | Cumulative sequence lengths (int32), starting from 0, for FlashAttention varlen API |
| max_seqlen_in_batch | int | The length of the longest individual sequence in the batch |
Usage Examples
from llamafactory.model.model_utils.packing import get_seqlens_in_batch, get_unpad_data
import torch
# Example packed attention mask:
# Batch of 2, each with packed sequences labeled by incrementing integers
attention_mask = torch.tensor([
[1, 1, 2, 2, 2, 0], # Two sequences: length 2, length 3, padding 1
[1, 2, 2, 3, 3, 3], # Three sequences: length 1, length 2, length 3
])
# Get individual sequence lengths
seqlens = get_seqlens_in_batch(attention_mask)
# Result: tensor([2, 3, 1, 2, 3])
# Get unpadding data for FlashAttention
indices, cu_seqlens, max_seqlen = get_unpad_data(attention_mask)
# indices: tensor([0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11])
# cu_seqlens: tensor([0, 2, 5, 6, 8, 11])
# max_seqlen: 3
# Enable packing (called automatically during model setup)
from llamafactory.model.model_utils.packing import configure_packing
configure_packing(model_args, is_trainable=True)
Related Pages
- Hiyouga_LLaMA_Factory_Attention_Config - Attention configuration; packing requires FlashAttention-2 for varlen support
- Hiyouga_LLaMA_Factory_Model_Loader - Model loader that invokes configure_packing during model setup
- Hiyouga_LLaMA_Factory_Gradient_Checkpointing - Complementary memory optimization used alongside sequence packing