Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenGVLab InternVL Mixed Preference Optimization

From Leeroopedia


Knowledge Sources
Domains Alignment, Reinforcement_Learning, Vision_Language
Last Updated 2026-02-07 00:00 GMT

Overview

A composite preference optimization technique that combines multiple loss functions (sigmoid DPO, BCO pair, and NLL) to align vision-language models with human preferences.

Description

Mixed Preference Optimization (MPO) extends Direct Preference Optimization (DPO) for multimodal models by combining multiple loss objectives into a single training signal. Standard DPO trains models to prefer chosen responses over rejected ones using a sigmoid loss. MPO augments this with:

  • Sigmoid DPO loss: Standard preference optimization based on implicit reward differences
  • BCO pair loss: Binary classifier optimization that provides an additional preference signal
  • NLL loss (via rpo_alpha): Standard next-token prediction loss on chosen responses, ensuring the model maintains generation quality

The composite loss is: =wsigmoidDPO+wbcoBCO+αrpoNLL

MPO also requires a frozen reference model — an identical copy of the policy model that provides the baseline log-probabilities for the KL divergence term in the DPO objective.

Usage

Use MPO after supervised fine-tuning to align the model with human preferences for response quality, correctness, and safety. It requires preference data with chosen/rejected response pairs.

Theoretical Basis

The DPO objective avoids explicit reward modeling by reparameterizing the reward function:

DPO(θ)=𝔼(x,yw,yl)[logσ(βlogπθ(yw|x)πref(yw|x)βlogπθ(yl|x)πref(yl|x))]

Where:

  • πθ is the policy model (trainable)
  • πref is the reference model (frozen)
  • yw,yl are chosen and rejected responses
  • β is the temperature parameter

The BCO pair loss adds a binary classification term, and the NLL loss ensures the model does not degrade in generation quality while optimizing for preferences.

InternVL's default MPO configuration:

  • wsigmoid=0.8
  • wbco=0.2
  • αrpo=1.0

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment