Implementation:OpenGVLab InternVL LazySupervisedDataset
| Knowledge Sources | |
|---|---|
| Domains | Vision_Language, Data_Engineering, Preprocessing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for lazily loading and preprocessing multimodal training data provided by the InternVL training framework.
Description
The LazySupervisedDataset class is the core data loading component of InternVL's training pipeline. It wraps JSONL annotation files and loads image/video data on-demand during training. Each sample is processed through the conversation template system, dynamic resolution tiling, and tokenization to produce training-ready tensors.
Key capabilities:
- Lazy loading of images from local disk or S3 (via TCSLoader)
- Dynamic image tiling with configurable patch counts (1-12+ tiles)
- Video frame sampling with configurable frame counts
- Multi-turn conversation formatting using LLM-specific templates
- Support for packed dataset mode (distributed data sharding)
Usage
Import this class when building a training dataset for InternVL fine-tuning or pretraining. It is instantiated once per dataset in the mixture, then combined via ConcatDataset or PackedDataset.
Code Reference
Source Location
- Repository: InternVL
- File: internvl_chat/internvl/train/dataset.py
- Lines: L269-743
Signature
class LazySupervisedDataset(torch.utils.data.Dataset):
def __init__(
self,
template_name,
meta,
tokenizer,
tcs_loader,
ds_name,
num_image_token,
image_size=448,
is_train=True,
pad2square=False,
group_by_length=False,
dynamic_image_size=False,
use_thumbnail=False,
min_dynamic_patch=1,
max_dynamic_patch=12,
min_num_frame=8,
max_num_frame=32,
sampling_method='rand',
repeat_time=1,
normalize_type='imagenet',
use_packed_ds=False,
data_rank=0,
data_world_size=1,
distributed_mode=False,
force_shuffle=False,
random_seed=0,
):
"""
Args:
template_name: Conversation template name (e.g. 'internvl2_5', 'internlm2-chat')
meta: Dict with 'annotation' (JSONL path) and 'root' (image directory) keys
tokenizer: HuggingFace tokenizer instance
tcs_loader: TCSLoader for S3/local image loading
ds_name: Dataset identifier string
num_image_token: Number of visual tokens per image tile
image_size: Pixel size for image processing (default 448)
is_train: Training mode flag
pad2square: Pad images to square before processing
group_by_length: Enable length-based grouping for efficient batching
dynamic_image_size: Enable dynamic resolution tiling
use_thumbnail: Include global thumbnail tile
min_dynamic_patch: Minimum number of tiles (default 1)
max_dynamic_patch: Maximum number of tiles (default 12)
min_num_frame: Minimum video frames to sample (default 8)
max_num_frame: Maximum video frames to sample (default 32)
sampling_method: Video frame sampling strategy ('rand')
repeat_time: Dataset repetition factor
normalize_type: Image normalization type ('imagenet', 'clip', 'siglip')
use_packed_ds: Enable packed dataset sharding mode
data_rank: Distributed rank for data sharding
data_world_size: World size for data sharding
distributed_mode: Enable distributed data loading
force_shuffle: Force data shuffling
random_seed: Random seed for reproducibility
"""
Import
from internvl.train.dataset import LazySupervisedDataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| template_name | str | Yes | Conversation template matching the LLM backend |
| meta | dict | Yes | Dataset metadata with 'annotation' and 'root' keys |
| tokenizer | PreTrainedTokenizer | Yes | HuggingFace tokenizer for text encoding |
| tcs_loader | TCSLoader | Yes | Image/video file loader (S3 or local) |
| ds_name | str | Yes | Dataset name identifier |
| num_image_token | int | Yes | Visual tokens per image tile (derived from model config) |
| dynamic_image_size | bool | No | Enable dynamic multi-tile image processing |
| max_dynamic_patch | int | No | Maximum tiles per image (default 12) |
Outputs
| Name | Type | Description |
|---|---|---|
| __getitem__ returns | Dict[str, torch.Tensor] | Per-sample dict with keys: input_ids, labels, attention_mask, pixel_values, image_flags |
| input_ids | torch.LongTensor | Tokenized input sequence with image tokens |
| labels | torch.LongTensor | Training labels (-100 for masked positions) |
| pixel_values | torch.FloatTensor | Preprocessed image tiles [N_tiles, 3, H, W] |
| image_flags | torch.LongTensor | Flags indicating real (1) vs padding (0) image tiles |
Usage Examples
Basic Dataset Creation
from internvl.train.dataset import LazySupervisedDataset
# Define dataset metadata
meta = {
'annotation': '/path/to/train.jsonl',
'root': '/path/to/images/',
'data_augment': False,
'repeat_time': 1,
}
# Create dataset with dynamic resolution
dataset = LazySupervisedDataset(
template_name='internvl2_5',
meta=meta,
tokenizer=tokenizer,
tcs_loader=tcs_loader,
ds_name='custom_dataset',
num_image_token=256,
image_size=448,
dynamic_image_size=True,
use_thumbnail=True,
max_dynamic_patch=12,
)
# Access a single sample
sample = dataset[0]
print(sample['input_ids'].shape) # [seq_len]
print(sample['pixel_values'].shape) # [num_tiles, 3, 448, 448]
JSONL Annotation Format
# Each line in the JSONL annotation file:
{
"image": "relative/path/to/image.jpg",
"conversations": [
{"from": "human", "value": "<image>\nDescribe this image."},
{"from": "gpt", "value": "The image shows..."}
]
}
# For video data:
{
"video": "relative/path/to/video.mp4",
"conversations": [
{"from": "human", "value": "<video>\nWhat is happening?"},
{"from": "gpt", "value": "In the video..."}
]
}
# For multi-image:
{
"image": ["img1.jpg", "img2.jpg", "img3.jpg"],
"conversations": [
{"from": "human", "value": "<image>\n<image>\n<image>\nCompare these images."},
{"from": "gpt", "value": "The images show..."}
]
}