Principle:OpenGVLab InternVL DPO Data Collation
| Knowledge Sources | |
|---|---|
| Domains | Alignment, Data_Engineering, Vision_Language |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A specialized batching strategy for preference optimization training that handles chosen and rejected response pairs alongside multimodal image data.
Description
DPO training requires paired examples: for each input, both a chosen (preferred) and a rejected (dispreferred) response must be present. The DPO data collator extends the standard multimodal collator to:
- Pad chosen and rejected text sequences independently (they may have different lengths)
- Concatenate pixel_values and image_flags across all samples in the batch
- Maintain separate chosen/rejected fields for the DPO loss computation
Unlike the standard collator which handles single responses, the DPO collator manages four text sequences per sample: chosen_input_ids, chosen_labels, rejected_input_ids, rejected_labels.
Usage
Use this collator for DPO/MPO preference optimization training. It replaces the standard concat_pad_data_collator when training with MultimodalDPOTrainer.
Theoretical Basis
# Pseudo-code: DPO batch collation
def dpo_collate(samples):
# Pad chosen and rejected independently
chosen_ids = pad([s['chosen_input_ids'] for s in samples])
chosen_labels = pad([s['chosen_labels'] for s in samples], pad_value=-100)
rejected_ids = pad([s['rejected_input_ids'] for s in samples])
rejected_labels = pad([s['rejected_labels'] for s in samples], pad_value=-100)
# Concatenate multimodal data across batch
pixel_values = cat([s['pixel_values'] for s in samples])
image_flags = cat([s['image_flags'] for s in samples])
return {chosen_ids, chosen_labels, rejected_ids, rejected_labels,
pixel_values, image_flags}