Workflow:OpenGVLab InternVL Preference Optimization
| Knowledge Sources | |
|---|---|
| Domains | VLMs, RLHF, Preference_Optimization, Multimodal |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
End-to-end process for aligning InternVL models using Mixed Preference Optimization (MPO), combining sigmoid and BCO pair losses for preference-based training.
Description
This workflow implements preference optimization for InternVL models, a post-training alignment technique that improves model outputs by learning from preference pairs (chosen vs. rejected responses). The approach uses a custom MultimodalDPOTrainer that supports multiple loss functions including sigmoid (standard DPO), BCO pair, hinge, and IPO losses. The training maintains a frozen reference model alongside the trainable model to compute preference logit differences. MPO combines multiple loss types with configurable weights to achieve robust alignment.
Usage
Execute this workflow after supervised fine-tuning when you want to improve the model's response quality by training on preference data. This requires paired data where each example has a question with both a chosen (preferred) and rejected (less preferred) response. The MMPR (Multi-Modal Preference Ranking) dataset format is used. This step is optional but can significantly improve output quality, reduce hallucinations, and improve instruction following.
Execution Steps
Step 1: Prepare Preference Data
Construct or obtain preference pair datasets in the required format. Each example consists of a multimodal input (image/video + question) with a chosen response and a rejected response. The data can be constructed automatically using correctness-based sampling or VisualPRM-based methods provided in the repository.
Key considerations:
- Each sample needs: image/video, question, chosen response, and rejected response
- The MMPR-v1.1 dataset provides pre-built preference pairs for InternVL
- Correctness-based construction: sample multiple model outputs and use ground truth to determine chosen/rejected
- VisualPRM construction: use step-level reasoning quality to build preference pairs
- Data is stored in JSONL format with question, chosen, and rejected fields
Step 2: Configure MPO Training
Set up the preference optimization configuration including loss type weights, learning rate, and reference model settings. The standard MPO configuration uses a weighted combination of sigmoid loss (weight 0.8) and BCO pair loss (weight 0.2).
Key considerations:
- Loss type is set to sigmoid,bco_pair combining two preference losses
- Very low learning rate (1e-6) to preserve capabilities from supervised training
- All model components are unfrozen (vision, MLP, LLM)
- Requires 256 GPUs for the standard 8B model configuration
- Gradient accumulation of 256 is used to compensate for per-device batch size of 1
- Liger kernel optimization is enabled for efficiency
Step 3: Load Model and Reference
Load the supervised fine-tuned InternVL model as both the trainable model and the reference model. The reference model is set to evaluation mode and its parameters are frozen. During training, both models process the same inputs, and the loss is computed from the difference in their log-probabilities for chosen versus rejected responses.
Key considerations:
- Both models are initialized from the same supervised fine-tuned checkpoint
- The reference model stays frozen throughout training (eval mode)
- DeepSpeed ZeRO Stage 1 is used for distributed training
- Memory usage is approximately doubled due to maintaining two model copies
Step 4: Train with MPO Loss
Run the preference optimization training loop using the MultimodalDPOTrainer. For each batch, the trainer computes forward passes through both the policy model and the reference model, then calculates the combined preference loss to update only the policy model's weights.
Key considerations:
- The trainer returns 6 tensors per sample: chosen/rejected input_ids, labels, and attention_mask
- Loss is computed as a weighted sum of sigmoid and BCO pair losses
- Only the policy model is updated; the reference model remains unchanged
- Training typically runs for 1 epoch over the preference dataset
- Per-sample data collation handles the dual chosen/rejected format
Step 5: Export Aligned Model
Save the aligned model checkpoint. The trainer copies necessary model configuration files to the output directory alongside the trained weights. The aligned model can be used directly for inference.
Key considerations:
- Model files are copied to the output directory for self-contained deployment
- The reference model is discarded after training
- The aligned model replaces the base model for downstream use
- Verify alignment quality on representative examples before deployment