Principle:OpenGVLab InternVL Mixed Preference Optimization

Knowledge Sources	DPO InternVL 2.5 InternVL
Domains	Alignment, Reinforcement_Learning, Vision_Language
Last Updated	2026-02-07 00:00 GMT

Overview

A composite preference optimization technique that combines multiple loss functions (sigmoid DPO, BCO pair, and NLL) to align vision-language models with human preferences.

Description

Mixed Preference Optimization (MPO) extends Direct Preference Optimization (DPO) for multimodal models by combining multiple loss objectives into a single training signal. Standard DPO trains models to prefer chosen responses over rejected ones using a sigmoid loss. MPO augments this with:

Sigmoid DPO loss: Standard preference optimization based on implicit reward differences
BCO pair loss: Binary classifier optimization that provides an additional preference signal
NLL loss (via rpo_alpha): Standard next-token prediction loss on chosen responses, ensuring the model maintains generation quality

The composite loss is: $ℒ = w_{s i g m o i d} \cdot ℒ_{D P O} + w_{b c o} \cdot ℒ_{B C O} + α_{r p o} \cdot ℒ_{N L L}$

MPO also requires a frozen reference model — an identical copy of the policy model that provides the baseline log-probabilities for the KL divergence term in the DPO objective.

Usage

Use MPO after supervised fine-tuning to align the model with human preferences for response quality, correctness, and safety. It requires preference data with chosen/rejected response pairs.

Theoretical Basis

The DPO objective avoids explicit reward modeling by reparameterizing the reward function:

$ℒ_{D P O} (θ) = - 𝔼_{(x, y_{w}, y_{l})} [\log σ (β \log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - β \log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)})]$

Where:

$π_{θ}$ is the policy model (trainable)
$π_{r e f}$ is the reference model (frozen)
$y_{w}, y_{l}$ are chosen and rejected responses
$β$ is the temperature parameter

The BCO pair loss adds a binary classification term, and the NLL loss ensures the model does not degrade in generation quality while optimizing for preferences.

InternVL's default MPO configuration:

$w_{s i g m o i d} = 0.8$
$w_{b c o} = 0.2$
$α_{r p o} = 1.0$

Related Pages

Implemented By

Implementation:OpenGVLab_InternVL_MultimodalDPOTrainer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment