Principle:OpenGVLab InternVL Packed Sequence Training
| Knowledge Sources | |
|---|---|
| Domains | Training, Optimization, Distributed_Computing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A training efficiency technique that packs multiple variable-length samples into fixed-length sequences using a greedy bin-packing algorithm to maximize GPU utilization.
Description
In standard training, batches are formed by padding all samples to the maximum sequence length, wasting computation on padding tokens. Packed sequence training instead concatenates multiple shorter samples into a single fixed-length sequence (the token budget), separated by attention boundaries.
InternVL implements a greedy bin-packing algorithm that:
- Maintains a buffer of candidate samples from the dataset
- For each packed sequence, greedily selects samples that fit within the remaining token budget
- Splits samples that exceed the budget across multiple packed sequences
- Uses Flash Attention varlen (variable-length) mode to prevent cross-sample attention
This is particularly important for multimodal training where samples vary greatly in length (a simple caption vs. a complex multi-turn reasoning chain).
Usage
Use packed training for large-scale pretraining (Stages 1, 1.5, 2) where maximizing training throughput is critical. It is enabled by setting use_packed_ds=True in DataTrainingArguments.
Theoretical Basis
The greedy bin-packing algorithm:
# Pseudo-code: Greedy bin-packing for training sequences
def pack_sequences(samples, max_tokens, max_images):
packed = []
current_tokens = 0
current_images = 0
current_pack = []
for sample in buffer:
sample_tokens = len(sample.input_ids)
sample_images = count_images(sample)
if current_tokens + sample_tokens <= max_tokens and \
current_images + sample_images <= max_images:
current_pack.append(sample)
current_tokens += sample_tokens
current_images += sample_images
elif sample_tokens > max_tokens:
# Split oversized sample across packs
parts = split_sample(sample, max_tokens - current_tokens)
current_pack.append(parts[0])
yield finalize(current_pack)
# Continue with remaining parts...
else:
yield finalize(current_pack)
current_pack = [sample]
current_tokens = sample_tokens
current_images = sample_images
Key constraint: Packed training requires Flash Attention varlen to compute attention only within each sample's token span, preventing information leakage between packed samples.