Principle:OpenGVLab InternVL Packed Sequence Training

Knowledge Sources	Packing InternVL
Domains	Training, Optimization, Distributed_Computing
Last Updated	2026-02-07 00:00 GMT

Overview

A training efficiency technique that packs multiple variable-length samples into fixed-length sequences using a greedy bin-packing algorithm to maximize GPU utilization.

Description

In standard training, batches are formed by padding all samples to the maximum sequence length, wasting computation on padding tokens. Packed sequence training instead concatenates multiple shorter samples into a single fixed-length sequence (the token budget), separated by attention boundaries.

InternVL implements a greedy bin-packing algorithm that:

Maintains a buffer of candidate samples from the dataset
For each packed sequence, greedily selects samples that fit within the remaining token budget
Splits samples that exceed the budget across multiple packed sequences
Uses Flash Attention varlen (variable-length) mode to prevent cross-sample attention

This is particularly important for multimodal training where samples vary greatly in length (a simple caption vs. a complex multi-turn reasoning chain).

Usage

Use packed training for large-scale pretraining (Stages 1, 1.5, 2) where maximizing training throughput is critical. It is enabled by setting use_packed_ds=True in DataTrainingArguments.

Theoretical Basis

The greedy bin-packing algorithm:

# Pseudo-code: Greedy bin-packing for training sequences
def pack_sequences(samples, max_tokens, max_images):
    packed = []
    current_tokens = 0
    current_images = 0
    current_pack = []

    for sample in buffer:
        sample_tokens = len(sample.input_ids)
        sample_images = count_images(sample)

        if current_tokens + sample_tokens <= max_tokens and \
           current_images + sample_images <= max_images:
            current_pack.append(sample)
            current_tokens += sample_tokens
            current_images += sample_images
        elif sample_tokens > max_tokens:
            # Split oversized sample across packs
            parts = split_sample(sample, max_tokens - current_tokens)
            current_pack.append(parts[0])
            yield finalize(current_pack)
            # Continue with remaining parts...
        else:
            yield finalize(current_pack)
            current_pack = [sample]
            current_tokens = sample_tokens
            current_images = sample_images

Key constraint: Packed training requires Flash Attention varlen to compute attention only within each sample's token span, preventing information leakage between packed samples.

Related Pages

Implemented By

Implementation:OpenGVLab_InternVL_PackedDataset

Uses Heuristic

Heuristic:OpenGVLab_InternVL_Packed_Training_Buffer_Management

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment