Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:OpenGVLab InternVL LazySupervisedDataset

From Leeroopedia


Knowledge Sources
Domains Vision_Language, Data_Engineering, Preprocessing
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for lazily loading and preprocessing multimodal training data provided by the InternVL training framework.

Description

The LazySupervisedDataset class is the core data loading component of InternVL's training pipeline. It wraps JSONL annotation files and loads image/video data on-demand during training. Each sample is processed through the conversation template system, dynamic resolution tiling, and tokenization to produce training-ready tensors.

Key capabilities:

  • Lazy loading of images from local disk or S3 (via TCSLoader)
  • Dynamic image tiling with configurable patch counts (1-12+ tiles)
  • Video frame sampling with configurable frame counts
  • Multi-turn conversation formatting using LLM-specific templates
  • Support for packed dataset mode (distributed data sharding)

Usage

Import this class when building a training dataset for InternVL fine-tuning or pretraining. It is instantiated once per dataset in the mixture, then combined via ConcatDataset or PackedDataset.

Code Reference

Source Location

  • Repository: InternVL
  • File: internvl_chat/internvl/train/dataset.py
  • Lines: L269-743

Signature

class LazySupervisedDataset(torch.utils.data.Dataset):
    def __init__(
        self,
        template_name,
        meta,
        tokenizer,
        tcs_loader,
        ds_name,
        num_image_token,
        image_size=448,
        is_train=True,
        pad2square=False,
        group_by_length=False,
        dynamic_image_size=False,
        use_thumbnail=False,
        min_dynamic_patch=1,
        max_dynamic_patch=12,
        min_num_frame=8,
        max_num_frame=32,
        sampling_method='rand',
        repeat_time=1,
        normalize_type='imagenet',
        use_packed_ds=False,
        data_rank=0,
        data_world_size=1,
        distributed_mode=False,
        force_shuffle=False,
        random_seed=0,
    ):
        """
        Args:
            template_name: Conversation template name (e.g. 'internvl2_5', 'internlm2-chat')
            meta: Dict with 'annotation' (JSONL path) and 'root' (image directory) keys
            tokenizer: HuggingFace tokenizer instance
            tcs_loader: TCSLoader for S3/local image loading
            ds_name: Dataset identifier string
            num_image_token: Number of visual tokens per image tile
            image_size: Pixel size for image processing (default 448)
            is_train: Training mode flag
            pad2square: Pad images to square before processing
            group_by_length: Enable length-based grouping for efficient batching
            dynamic_image_size: Enable dynamic resolution tiling
            use_thumbnail: Include global thumbnail tile
            min_dynamic_patch: Minimum number of tiles (default 1)
            max_dynamic_patch: Maximum number of tiles (default 12)
            min_num_frame: Minimum video frames to sample (default 8)
            max_num_frame: Maximum video frames to sample (default 32)
            sampling_method: Video frame sampling strategy ('rand')
            repeat_time: Dataset repetition factor
            normalize_type: Image normalization type ('imagenet', 'clip', 'siglip')
            use_packed_ds: Enable packed dataset sharding mode
            data_rank: Distributed rank for data sharding
            data_world_size: World size for data sharding
            distributed_mode: Enable distributed data loading
            force_shuffle: Force data shuffling
            random_seed: Random seed for reproducibility
        """

Import

from internvl.train.dataset import LazySupervisedDataset

I/O Contract

Inputs

Name Type Required Description
template_name str Yes Conversation template matching the LLM backend
meta dict Yes Dataset metadata with 'annotation' and 'root' keys
tokenizer PreTrainedTokenizer Yes HuggingFace tokenizer for text encoding
tcs_loader TCSLoader Yes Image/video file loader (S3 or local)
ds_name str Yes Dataset name identifier
num_image_token int Yes Visual tokens per image tile (derived from model config)
dynamic_image_size bool No Enable dynamic multi-tile image processing
max_dynamic_patch int No Maximum tiles per image (default 12)

Outputs

Name Type Description
__getitem__ returns Dict[str, torch.Tensor] Per-sample dict with keys: input_ids, labels, attention_mask, pixel_values, image_flags
input_ids torch.LongTensor Tokenized input sequence with image tokens
labels torch.LongTensor Training labels (-100 for masked positions)
pixel_values torch.FloatTensor Preprocessed image tiles [N_tiles, 3, H, W]
image_flags torch.LongTensor Flags indicating real (1) vs padding (0) image tiles

Usage Examples

Basic Dataset Creation

from internvl.train.dataset import LazySupervisedDataset

# Define dataset metadata
meta = {
    'annotation': '/path/to/train.jsonl',
    'root': '/path/to/images/',
    'data_augment': False,
    'repeat_time': 1,
}

# Create dataset with dynamic resolution
dataset = LazySupervisedDataset(
    template_name='internvl2_5',
    meta=meta,
    tokenizer=tokenizer,
    tcs_loader=tcs_loader,
    ds_name='custom_dataset',
    num_image_token=256,
    image_size=448,
    dynamic_image_size=True,
    use_thumbnail=True,
    max_dynamic_patch=12,
)

# Access a single sample
sample = dataset[0]
print(sample['input_ids'].shape)      # [seq_len]
print(sample['pixel_values'].shape)   # [num_tiles, 3, 448, 448]

JSONL Annotation Format

# Each line in the JSONL annotation file:
{
    "image": "relative/path/to/image.jpg",
    "conversations": [
        {"from": "human", "value": "<image>\nDescribe this image."},
        {"from": "gpt", "value": "The image shows..."}
    ]
}

# For video data:
{
    "video": "relative/path/to/video.mp4",
    "conversations": [
        {"from": "human", "value": "<video>\nWhat is happening?"},
        {"from": "gpt", "value": "In the video..."}
    ]
}

# For multi-image:
{
    "image": ["img1.jpg", "img2.jpg", "img3.jpg"],
    "conversations": [
        {"from": "human", "value": "<image>\n<image>\n<image>\nCompare these images."},
        {"from": "gpt", "value": "The images show..."}
    ]
}

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment