Principle:OpenGVLab InternVL Mixed Preference Optimization
| Knowledge Sources | |
|---|---|
| Domains | Alignment, Reinforcement_Learning, Vision_Language |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A composite preference optimization technique that combines multiple loss functions (sigmoid DPO, BCO pair, and NLL) to align vision-language models with human preferences.
Description
Mixed Preference Optimization (MPO) extends Direct Preference Optimization (DPO) for multimodal models by combining multiple loss objectives into a single training signal. Standard DPO trains models to prefer chosen responses over rejected ones using a sigmoid loss. MPO augments this with:
- Sigmoid DPO loss: Standard preference optimization based on implicit reward differences
- BCO pair loss: Binary classifier optimization that provides an additional preference signal
- NLL loss (via rpo_alpha): Standard next-token prediction loss on chosen responses, ensuring the model maintains generation quality
The composite loss is:
MPO also requires a frozen reference model — an identical copy of the policy model that provides the baseline log-probabilities for the KL divergence term in the DPO objective.
Usage
Use MPO after supervised fine-tuning to align the model with human preferences for response quality, correctness, and safety. It requires preference data with chosen/rejected response pairs.
Theoretical Basis
The DPO objective avoids explicit reward modeling by reparameterizing the reward function:
Where:
- is the policy model (trainable)
- is the reference model (frozen)
- are chosen and rejected responses
- is the temperature parameter
The BCO pair loss adds a binary classification term, and the NLL loss ensures the model does not degrade in generation quality while optimizing for preferences.
InternVL's default MPO configuration: