Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenGVLab InternVL Multimodal Data Collation

From Leeroopedia


Knowledge Sources
Domains Training, Data_Engineering, Vision_Language
Last Updated 2026-02-07 00:00 GMT

Overview

A batching strategy for multimodal training that pads variable-length text sequences and concatenates variable-count image tiles into unified batch tensors.

Description

Multimodal data collation solves the challenge of batching samples with different numbers of image tiles and different text sequence lengths. In vision-language training, each sample may have a different number of image tiles (due to dynamic resolution) and different conversation lengths. The collator must:

  • Pad text sequences to the maximum length in the batch (using pad_id=0 for input_ids, -100 for labels)
  • Concatenate image tiles across all samples in the batch (since each sample may have 1-12+ tiles)
  • Track image ownership via image_flags so the model knows which tiles belong to which sample
  • Handle attention masks to prevent attending to padding tokens

Usage

Use this principle whenever batching multimodal training samples for InternVL. The standard collator is used for supervised fine-tuning; a separate DPO-specific collator handles preference optimization pairs.

Theoretical Basis

# Pseudo-code: Multimodal batch collation
def collate(samples):
    max_len = max(len(s['input_ids']) for s in samples)

    # Pad text to max length
    input_ids = pad([s['input_ids'] for s in samples], max_len, pad_value=0)
    labels = pad([s['labels'] for s in samples], max_len, pad_value=-100)
    attention_mask = pad([s['attention_mask'] for s in samples], max_len, pad_value=0)

    # Concatenate variable-count image tiles
    pixel_values = torch.cat([s['pixel_values'] for s in samples], dim=0)
    image_flags = torch.cat([s['image_flags'] for s in samples], dim=0)

    return {input_ids, labels, attention_mask, pixel_values, image_flags}

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment