Principle:OpenGVLab InternVL DPO Data Collation

Knowledge Sources	InternVL
Domains	Alignment, Data_Engineering, Vision_Language
Last Updated	2026-02-07 00:00 GMT

Overview

A specialized batching strategy for preference optimization training that handles chosen and rejected response pairs alongside multimodal image data.

Description

DPO training requires paired examples: for each input, both a chosen (preferred) and a rejected (dispreferred) response must be present. The DPO data collator extends the standard multimodal collator to:

Pad chosen and rejected text sequences independently (they may have different lengths)
Concatenate pixel_values and image_flags across all samples in the batch
Maintain separate chosen/rejected fields for the DPO loss computation

Unlike the standard collator which handles single responses, the DPO collator manages four text sequences per sample: chosen_input_ids, chosen_labels, rejected_input_ids, rejected_labels.

Usage

Use this collator for DPO/MPO preference optimization training. It replaces the standard concat_pad_data_collator when training with MultimodalDPOTrainer.

Theoretical Basis

# Pseudo-code: DPO batch collation
def dpo_collate(samples):
    # Pad chosen and rejected independently
    chosen_ids = pad([s['chosen_input_ids'] for s in samples])
    chosen_labels = pad([s['chosen_labels'] for s in samples], pad_value=-100)
    rejected_ids = pad([s['rejected_input_ids'] for s in samples])
    rejected_labels = pad([s['rejected_labels'] for s in samples], pad_value=-100)

    # Concatenate multimodal data across batch
    pixel_values = cat([s['pixel_values'] for s in samples])
    image_flags = cat([s['image_flags'] for s in samples])

    return {chosen_ids, chosen_labels, rejected_ids, rejected_labels,
            pixel_values, image_flags}

Related Pages

Implemented By

Implementation:OpenGVLab_InternVL_DPO_Concat_Pad_Data_Collator

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment