Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenGVLab InternVL Packed Sequence Training

From Leeroopedia


Knowledge Sources
Domains Training, Optimization, Distributed_Computing
Last Updated 2026-02-07 00:00 GMT

Overview

A training efficiency technique that packs multiple variable-length samples into fixed-length sequences using a greedy bin-packing algorithm to maximize GPU utilization.

Description

In standard training, batches are formed by padding all samples to the maximum sequence length, wasting computation on padding tokens. Packed sequence training instead concatenates multiple shorter samples into a single fixed-length sequence (the token budget), separated by attention boundaries.

InternVL implements a greedy bin-packing algorithm that:

  1. Maintains a buffer of candidate samples from the dataset
  2. For each packed sequence, greedily selects samples that fit within the remaining token budget
  3. Splits samples that exceed the budget across multiple packed sequences
  4. Uses Flash Attention varlen (variable-length) mode to prevent cross-sample attention

This is particularly important for multimodal training where samples vary greatly in length (a simple caption vs. a complex multi-turn reasoning chain).

Usage

Use packed training for large-scale pretraining (Stages 1, 1.5, 2) where maximizing training throughput is critical. It is enabled by setting use_packed_ds=True in DataTrainingArguments.

Theoretical Basis

The greedy bin-packing algorithm:

# Pseudo-code: Greedy bin-packing for training sequences
def pack_sequences(samples, max_tokens, max_images):
    packed = []
    current_tokens = 0
    current_images = 0
    current_pack = []

    for sample in buffer:
        sample_tokens = len(sample.input_ids)
        sample_images = count_images(sample)

        if current_tokens + sample_tokens <= max_tokens and \
           current_images + sample_images <= max_images:
            current_pack.append(sample)
            current_tokens += sample_tokens
            current_images += sample_images
        elif sample_tokens > max_tokens:
            # Split oversized sample across packs
            parts = split_sample(sample, max_tokens - current_tokens)
            current_pack.append(parts[0])
            yield finalize(current_pack)
            # Continue with remaining parts...
        else:
            yield finalize(current_pack)
            current_pack = [sample]
            current_tokens = sample_tokens
            current_images = sample_images

Key constraint: Packed training requires Flash Attention varlen to compute attention only within each sample's token span, preventing information leakage between packed samples.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment