Implementation:OpenGVLab InternVL LazySupervisedDataset

Knowledge Sources	InternVL HuggingFace Datasets
Domains	Vision_Language, Data_Engineering, Preprocessing
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for lazily loading and preprocessing multimodal training data provided by the InternVL training framework.

Description

The LazySupervisedDataset class is the core data loading component of InternVL's training pipeline. It wraps JSONL annotation files and loads image/video data on-demand during training. Each sample is processed through the conversation template system, dynamic resolution tiling, and tokenization to produce training-ready tensors.

Key capabilities:

Lazy loading of images from local disk or S3 (via TCSLoader)
Dynamic image tiling with configurable patch counts (1-12+ tiles)
Video frame sampling with configurable frame counts
Multi-turn conversation formatting using LLM-specific templates
Support for packed dataset mode (distributed data sharding)

Usage

Import this class when building a training dataset for InternVL fine-tuning or pretraining. It is instantiated once per dataset in the mixture, then combined via ConcatDataset or PackedDataset.

Code Reference

Source Location

Repository: InternVL
File: internvl_chat/internvl/train/dataset.py
Lines: L269-743

Signature

class LazySupervisedDataset(torch.utils.data.Dataset):
    def __init__(
        self,
        template_name,
        meta,
        tokenizer,
        tcs_loader,
        ds_name,
        num_image_token,
        image_size=448,
        is_train=True,
        pad2square=False,
        group_by_length=False,
        dynamic_image_size=False,
        use_thumbnail=False,
        min_dynamic_patch=1,
        max_dynamic_patch=12,
        min_num_frame=8,
        max_num_frame=32,
        sampling_method='rand',
        repeat_time=1,
        normalize_type='imagenet',
        use_packed_ds=False,
        data_rank=0,
        data_world_size=1,
        distributed_mode=False,
        force_shuffle=False,
        random_seed=0,
    ):
        """
        Args:
            template_name: Conversation template name (e.g. 'internvl2_5', 'internlm2-chat')
            meta: Dict with 'annotation' (JSONL path) and 'root' (image directory) keys
            tokenizer: HuggingFace tokenizer instance
            tcs_loader: TCSLoader for S3/local image loading
            ds_name: Dataset identifier string
            num_image_token: Number of visual tokens per image tile
            image_size: Pixel size for image processing (default 448)
            is_train: Training mode flag
            pad2square: Pad images to square before processing
            group_by_length: Enable length-based grouping for efficient batching
            dynamic_image_size: Enable dynamic resolution tiling
            use_thumbnail: Include global thumbnail tile
            min_dynamic_patch: Minimum number of tiles (default 1)
            max_dynamic_patch: Maximum number of tiles (default 12)
            min_num_frame: Minimum video frames to sample (default 8)
            max_num_frame: Maximum video frames to sample (default 32)
            sampling_method: Video frame sampling strategy ('rand')
            repeat_time: Dataset repetition factor
            normalize_type: Image normalization type ('imagenet', 'clip', 'siglip')
            use_packed_ds: Enable packed dataset sharding mode
            data_rank: Distributed rank for data sharding
            data_world_size: World size for data sharding
            distributed_mode: Enable distributed data loading
            force_shuffle: Force data shuffling
            random_seed: Random seed for reproducibility
        """

Import

from internvl.train.dataset import LazySupervisedDataset

I/O Contract

Inputs

Name	Type	Required	Description
template_name	str	Yes	Conversation template matching the LLM backend
meta	dict	Yes	Dataset metadata with 'annotation' and 'root' keys
tokenizer	PreTrainedTokenizer	Yes	HuggingFace tokenizer for text encoding
tcs_loader	TCSLoader	Yes	Image/video file loader (S3 or local)
ds_name	str	Yes	Dataset name identifier
num_image_token	int	Yes	Visual tokens per image tile (derived from model config)
dynamic_image_size	bool	No	Enable dynamic multi-tile image processing
max_dynamic_patch	int	No	Maximum tiles per image (default 12)

Outputs

Name	Type	Description
__getitem__ returns	Dict[str, torch.Tensor]	Per-sample dict with keys: input_ids, labels, attention_mask, pixel_values, image_flags
input_ids	torch.LongTensor	Tokenized input sequence with image tokens
labels	torch.LongTensor	Training labels (-100 for masked positions)
pixel_values	torch.FloatTensor	Preprocessed image tiles [N_tiles, 3, H, W]
image_flags	torch.LongTensor	Flags indicating real (1) vs padding (0) image tiles

Usage Examples

Basic Dataset Creation

from internvl.train.dataset import LazySupervisedDataset

# Define dataset metadata
meta = {
    'annotation': '/path/to/train.jsonl',
    'root': '/path/to/images/',
    'data_augment': False,
    'repeat_time': 1,
}

# Create dataset with dynamic resolution
dataset = LazySupervisedDataset(
    template_name='internvl2_5',
    meta=meta,
    tokenizer=tokenizer,
    tcs_loader=tcs_loader,
    ds_name='custom_dataset',
    num_image_token=256,
    image_size=448,
    dynamic_image_size=True,
    use_thumbnail=True,
    max_dynamic_patch=12,
)

# Access a single sample
sample = dataset[0]
print(sample['input_ids'].shape)      # [seq_len]
print(sample['pixel_values'].shape)   # [num_tiles, 3, 448, 448]

JSONL Annotation Format

# Each line in the JSONL annotation file:
{
    "image": "relative/path/to/image.jpg",
    "conversations": [
        {"from": "human", "value": "<image>\nDescribe this image."},
        {"from": "gpt", "value": "The image shows..."}
    ]
}

# For video data:
{
    "video": "relative/path/to/video.mp4",
    "conversations": [
        {"from": "human", "value": "<video>\nWhat is happening?"},
        {"from": "gpt", "value": "In the video..."}
    ]
}

# For multi-image:
{
    "image": ["img1.jpg", "img2.jpg", "img3.jpg"],
    "conversations": [
        {"from": "human", "value": "<image>\n<image>\n<image>\nCompare these images."},
        {"from": "gpt", "value": "The images show..."}
    ]
}

Related Pages

Implements Principle

Principle:OpenGVLab_InternVL_Multimodal_Data_Preparation

Requires Environment

Environment:OpenGVLab_InternVL_PyTorch_CUDA

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment