Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:OpenGVLab InternVL PackedDataset

From Leeroopedia


Knowledge Sources
Domains Training, Optimization
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for packing multiple training samples into fixed-length sequences using greedy bin-packing provided by the InternVL training framework.

Description

The PackedDataset class implements an IterableDataset that wraps multiple LazySupervisedDataset instances and packs their samples into fixed-length token sequences. It uses a greedy algorithm with a configurable token budget and image count limit per packed sample.

Key methods:

  • find_buffer: Searches the buffer for samples that fit the remaining capacity
  • split_buffer: Splits oversized samples across pack boundaries
  • update_buffer_list: Refills the buffer from underlying datasets
  • __iter__: Main packing loop yielding packed sequences

Usage

Used in pretraining (all stages) and optionally in fine-tuning when use_packed_ds=True. Instantiated by build_datasets when packed mode is enabled.

Code Reference

Source Location

  • Repository: InternVL
  • File: internvl_chat/internvl/train/dataset_packed.py
  • Lines: L46-545

Signature

class PackedDataset(torch.utils.data.IterableDataset):
    def __init__(
        self,
        tokenizer,
        data_rank,
        data_world_size,
        datasets: List,
        dataset_weight: List[int] = None,
        num_images_expected: int = 6,
        max_packed_tokens: int = 32768,
        max_buffer_size: int = 100,
        log_freq: int = 1000000,
        strict_mode: bool = False,
        debug_mode: bool = False,
        replacement: bool = True,
        allow_overflow: bool = True,
        allow_empty_data: bool = False,
        allow_deduplicated_ds_name: bool = False,
    ):
        """
        Args:
            tokenizer: Tokenizer for padding operations
            data_rank: Distributed rank for data sharding
            data_world_size: World size for data distribution
            datasets: List of LazySupervisedDataset instances to pack from
            dataset_weight: Sampling weights for each dataset
            num_images_expected: Maximum images per packed sample (default 6)
            max_packed_tokens: Token budget per packed sample (default 32768)
            max_buffer_size: Buffer size for candidate samples (default 100)
            strict_mode: Raise errors on data issues (default False)
            replacement: Sample datasets with replacement (default True)
            allow_overflow: Allow slightly exceeding token budget (default True)
        """

Import

from internvl.train.dataset_packed import PackedDataset

I/O Contract

Inputs

Name Type Required Description
tokenizer PreTrainedTokenizer Yes Tokenizer for pad token ID
data_rank int Yes Distributed rank for data sharding
data_world_size int Yes Total number of data workers
datasets List[LazySupervisedDataset] Yes Underlying datasets to pack
max_packed_tokens int No Token budget per packed sample (default 32768)
num_images_expected int No Max images per packed sample (default 6)

Outputs

Name Type Description
__iter__ yields Dict[str, torch.Tensor] Packed samples with concatenated input_ids, labels, pixel_values, image_flags, and cu_seqlens for Flash Attention varlen

Usage Examples

Create Packed Dataset for Pretraining

from internvl.train.dataset_packed import PackedDataset

packed_ds = PackedDataset(
    tokenizer=tokenizer,
    data_rank=torch.distributed.get_rank(),
    data_world_size=torch.distributed.get_world_size(),
    datasets=[ds1, ds2, ds3],
    dataset_weight=[1, 2, 1],
    num_images_expected=48,
    max_packed_tokens=16384,
    max_buffer_size=100,
)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment