Principle:OpenGVLab InternVL Multimodal Data Collation
| Knowledge Sources | |
|---|---|
| Domains | Training, Data_Engineering, Vision_Language |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A batching strategy for multimodal training that pads variable-length text sequences and concatenates variable-count image tiles into unified batch tensors.
Description
Multimodal data collation solves the challenge of batching samples with different numbers of image tiles and different text sequence lengths. In vision-language training, each sample may have a different number of image tiles (due to dynamic resolution) and different conversation lengths. The collator must:
- Pad text sequences to the maximum length in the batch (using pad_id=0 for input_ids, -100 for labels)
- Concatenate image tiles across all samples in the batch (since each sample may have 1-12+ tiles)
- Track image ownership via image_flags so the model knows which tiles belong to which sample
- Handle attention masks to prevent attending to padding tokens
Usage
Use this principle whenever batching multimodal training samples for InternVL. The standard collator is used for supervised fine-tuning; a separate DPO-specific collator handles preference optimization pairs.
Theoretical Basis
# Pseudo-code: Multimodal batch collation
def collate(samples):
max_len = max(len(s['input_ids']) for s in samples)
# Pad text to max length
input_ids = pad([s['input_ids'] for s in samples], max_len, pad_value=0)
labels = pad([s['labels'] for s in samples], max_len, pad_value=-100)
attention_mask = pad([s['attention_mask'] for s in samples], max_len, pad_value=0)
# Concatenate variable-count image tiles
pixel_values = torch.cat([s['pixel_values'] for s in samples], dim=0)
image_flags = torch.cat([s['image_flags'] for s in samples], dim=0)
return {input_ids, labels, attention_mask, pixel_values, image_flags}