Implementation:OpenGVLab InternVL PackedDataset
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Training, Optimization |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for packing multiple training samples into fixed-length sequences using greedy bin-packing provided by the InternVL training framework.
Description
The PackedDataset class implements an IterableDataset that wraps multiple LazySupervisedDataset instances and packs their samples into fixed-length token sequences. It uses a greedy algorithm with a configurable token budget and image count limit per packed sample.
Key methods:
- find_buffer: Searches the buffer for samples that fit the remaining capacity
- split_buffer: Splits oversized samples across pack boundaries
- update_buffer_list: Refills the buffer from underlying datasets
- __iter__: Main packing loop yielding packed sequences
Usage
Used in pretraining (all stages) and optionally in fine-tuning when use_packed_ds=True. Instantiated by build_datasets when packed mode is enabled.
Code Reference
Source Location
- Repository: InternVL
- File: internvl_chat/internvl/train/dataset_packed.py
- Lines: L46-545
Signature
class PackedDataset(torch.utils.data.IterableDataset):
def __init__(
self,
tokenizer,
data_rank,
data_world_size,
datasets: List,
dataset_weight: List[int] = None,
num_images_expected: int = 6,
max_packed_tokens: int = 32768,
max_buffer_size: int = 100,
log_freq: int = 1000000,
strict_mode: bool = False,
debug_mode: bool = False,
replacement: bool = True,
allow_overflow: bool = True,
allow_empty_data: bool = False,
allow_deduplicated_ds_name: bool = False,
):
"""
Args:
tokenizer: Tokenizer for padding operations
data_rank: Distributed rank for data sharding
data_world_size: World size for data distribution
datasets: List of LazySupervisedDataset instances to pack from
dataset_weight: Sampling weights for each dataset
num_images_expected: Maximum images per packed sample (default 6)
max_packed_tokens: Token budget per packed sample (default 32768)
max_buffer_size: Buffer size for candidate samples (default 100)
strict_mode: Raise errors on data issues (default False)
replacement: Sample datasets with replacement (default True)
allow_overflow: Allow slightly exceeding token budget (default True)
"""
Import
from internvl.train.dataset_packed import PackedDataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tokenizer | PreTrainedTokenizer | Yes | Tokenizer for pad token ID |
| data_rank | int | Yes | Distributed rank for data sharding |
| data_world_size | int | Yes | Total number of data workers |
| datasets | List[LazySupervisedDataset] | Yes | Underlying datasets to pack |
| max_packed_tokens | int | No | Token budget per packed sample (default 32768) |
| num_images_expected | int | No | Max images per packed sample (default 6) |
Outputs
| Name | Type | Description |
|---|---|---|
| __iter__ yields | Dict[str, torch.Tensor] | Packed samples with concatenated input_ids, labels, pixel_values, image_flags, and cu_seqlens for Flash Attention varlen |
Usage Examples
Create Packed Dataset for Pretraining
from internvl.train.dataset_packed import PackedDataset
packed_ds = PackedDataset(
tokenizer=tokenizer,
data_rank=torch.distributed.get_rank(),
data_world_size=torch.distributed.get_world_size(),
datasets=[ds1, ds2, ds3],
dataset_weight=[1, 2, 1],
num_images_expected=48,
max_packed_tokens=16384,
max_buffer_size=100,
)
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment