Implementation:OpenGVLab InternVL Build Transform

Knowledge Sources	InternVL
Domains	Computer_Vision, Preprocessing
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for building torchvision transform compositions provided by the InternVL preprocessing pipeline.

Description

The build_transform function constructs a torchvision.transforms.Compose pipeline that converts PIL images into normalized tensors. It supports three normalization schemes (ImageNet, CLIP, SigLIP), optional pad-to-square preprocessing, and separate train/eval augmentation paths.

Usage

Import this function when building image preprocessing pipelines for InternVL training or inference. It is called internally by LazySupervisedDataset and is also used in evaluation scripts.

Code Reference

Source Location

Repository: InternVL
File: internvl_chat/internvl/train/dataset.py
Lines: L276-310

Signature

def build_transform(is_train, input_size, pad2square=False, normalize_type='imagenet'):
    """
    Build a torchvision transform pipeline for image preprocessing.

    Args:
        is_train: bool - Whether to use training augmentations
        input_size: int - Target image size in pixels
        pad2square: bool - Pad images to square before resizing (default False)
        normalize_type: str - Normalization type: 'imagenet', 'clip', or 'siglip'

    Returns:
        torchvision.transforms.Compose - Complete transform pipeline
    """

Import

from internvl.train.dataset import build_transform

I/O Contract

Inputs

Name	Type	Required	Description
is_train	bool	Yes	Training mode enables random augmentations
input_size	int	Yes	Target pixel size (typically 448)
pad2square	bool	No	Pad to square before resize (default False)
normalize_type	str	No	Normalization statistics: 'imagenet', 'clip', 'siglip' (default 'imagenet')

Outputs

Name	Type	Description
transform	torchvision.transforms.Compose	Pipeline: [optional pad] -> Resize -> ToTensor -> Normalize

Usage Examples

Training Transform

from internvl.train.dataset import build_transform

# Training transform with ImageNet normalization
train_transform = build_transform(
    is_train=True,
    input_size=448,
    normalize_type='imagenet',
)

# Apply to a PIL image
from PIL import Image
img = Image.open('photo.jpg')
tensor = train_transform(img)  # shape: [3, 448, 448]

Inference Transform

# Inference transform (deterministic, no augmentation)
eval_transform = build_transform(
    is_train=False,
    input_size=448,
    normalize_type='imagenet',
)

Related Pages

Implements Principle

Principle:OpenGVLab_InternVL_Image_Transform_Pipeline

Requires Environment

Environment:OpenGVLab_InternVL_PyTorch_CUDA

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment