Implementation:OpenGVLab InternVL PackedDataset

Knowledge Sources	InternVL
Domains	Training, Optimization
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for packing multiple training samples into fixed-length sequences using greedy bin-packing provided by the InternVL training framework.

Description

The PackedDataset class implements an IterableDataset that wraps multiple LazySupervisedDataset instances and packs their samples into fixed-length token sequences. It uses a greedy algorithm with a configurable token budget and image count limit per packed sample.

Key methods:

find_buffer: Searches the buffer for samples that fit the remaining capacity
split_buffer: Splits oversized samples across pack boundaries
update_buffer_list: Refills the buffer from underlying datasets
__iter__: Main packing loop yielding packed sequences

Usage

Used in pretraining (all stages) and optionally in fine-tuning when use_packed_ds=True. Instantiated by build_datasets when packed mode is enabled.

Code Reference

Source Location

Repository: InternVL
File: internvl_chat/internvl/train/dataset_packed.py
Lines: L46-545

Signature

class PackedDataset(torch.utils.data.IterableDataset):
    def __init__(
        self,
        tokenizer,
        data_rank,
        data_world_size,
        datasets: List,
        dataset_weight: List[int] = None,
        num_images_expected: int = 6,
        max_packed_tokens: int = 32768,
        max_buffer_size: int = 100,
        log_freq: int = 1000000,
        strict_mode: bool = False,
        debug_mode: bool = False,
        replacement: bool = True,
        allow_overflow: bool = True,
        allow_empty_data: bool = False,
        allow_deduplicated_ds_name: bool = False,
    ):
        """
        Args:
            tokenizer: Tokenizer for padding operations
            data_rank: Distributed rank for data sharding
            data_world_size: World size for data distribution
            datasets: List of LazySupervisedDataset instances to pack from
            dataset_weight: Sampling weights for each dataset
            num_images_expected: Maximum images per packed sample (default 6)
            max_packed_tokens: Token budget per packed sample (default 32768)
            max_buffer_size: Buffer size for candidate samples (default 100)
            strict_mode: Raise errors on data issues (default False)
            replacement: Sample datasets with replacement (default True)
            allow_overflow: Allow slightly exceeding token budget (default True)
        """

Import

from internvl.train.dataset_packed import PackedDataset

I/O Contract

Inputs

Name	Type	Required	Description
tokenizer	PreTrainedTokenizer	Yes	Tokenizer for pad token ID
data_rank	int	Yes	Distributed rank for data sharding
data_world_size	int	Yes	Total number of data workers
datasets	List[LazySupervisedDataset]	Yes	Underlying datasets to pack
max_packed_tokens	int	No	Token budget per packed sample (default 32768)
num_images_expected	int	No	Max images per packed sample (default 6)

Outputs

Name	Type	Description
__iter__ yields	Dict[str, torch.Tensor]	Packed samples with concatenated input_ids, labels, pixel_values, image_flags, and cu_seqlens for Flash Attention varlen

Usage Examples

Create Packed Dataset for Pretraining

from internvl.train.dataset_packed import PackedDataset

packed_ds = PackedDataset(
    tokenizer=tokenizer,
    data_rank=torch.distributed.get_rank(),
    data_world_size=torch.distributed.get_world_size(),
    datasets=[ds1, ds2, ds3],
    dataset_weight=[1, 2, 1],
    num_images_expected=48,
    max_packed_tokens=16384,
    max_buffer_size=100,
)

Related Pages

Implements Principle

Principle:OpenGVLab_InternVL_Packed_Sequence_Training

Requires Environment

Uses Heuristic

Heuristic:OpenGVLab_InternVL_Packed_Training_Buffer_Management

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment