Principle:OpenGVLab InternVL Multimodal Data Collation

Knowledge Sources	InternVL HuggingFace Data Collator
Domains	Training, Data_Engineering, Vision_Language
Last Updated	2026-02-07 00:00 GMT

Overview

A batching strategy for multimodal training that pads variable-length text sequences and concatenates variable-count image tiles into unified batch tensors.

Description

Multimodal data collation solves the challenge of batching samples with different numbers of image tiles and different text sequence lengths. In vision-language training, each sample may have a different number of image tiles (due to dynamic resolution) and different conversation lengths. The collator must:

Pad text sequences to the maximum length in the batch (using pad_id=0 for input_ids, -100 for labels)
Concatenate image tiles across all samples in the batch (since each sample may have 1-12+ tiles)
Track image ownership via image_flags so the model knows which tiles belong to which sample
Handle attention masks to prevent attending to padding tokens

Usage

Use this principle whenever batching multimodal training samples for InternVL. The standard collator is used for supervised fine-tuning; a separate DPO-specific collator handles preference optimization pairs.

Theoretical Basis

# Pseudo-code: Multimodal batch collation
def collate(samples):
    max_len = max(len(s['input_ids']) for s in samples)

    # Pad text to max length
    input_ids = pad([s['input_ids'] for s in samples], max_len, pad_value=0)
    labels = pad([s['labels'] for s in samples], max_len, pad_value=-100)
    attention_mask = pad([s['attention_mask'] for s in samples], max_len, pad_value=0)

    # Concatenate variable-count image tiles
    pixel_values = torch.cat([s['pixel_values'] for s in samples], dim=0)
    image_flags = torch.cat([s['image_flags'] for s in samples], dim=0)

    return {input_ids, labels, attention_mask, pixel_values, image_flags}

Related Pages

Implemented By

Implementation:OpenGVLab_InternVL_Concat_Pad_Data_Collator

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment