Workflow:OpenGVLab InternVL Preference Optimization

Knowledge Sources	InternVL MPO README InternVL 2.5
Domains	VLMs, RLHF, Preference_Optimization, Multimodal
Last Updated	2026-02-07 14:00 GMT

Overview

End-to-end process for aligning InternVL models using Mixed Preference Optimization (MPO), combining sigmoid and BCO pair losses for preference-based training.

Description

This workflow implements preference optimization for InternVL models, a post-training alignment technique that improves model outputs by learning from preference pairs (chosen vs. rejected responses). The approach uses a custom MultimodalDPOTrainer that supports multiple loss functions including sigmoid (standard DPO), BCO pair, hinge, and IPO losses. The training maintains a frozen reference model alongside the trainable model to compute preference logit differences. MPO combines multiple loss types with configurable weights to achieve robust alignment.

Usage

Execute this workflow after supervised fine-tuning when you want to improve the model's response quality by training on preference data. This requires paired data where each example has a question with both a chosen (preferred) and rejected (less preferred) response. The MMPR (Multi-Modal Preference Ranking) dataset format is used. This step is optional but can significantly improve output quality, reduce hallucinations, and improve instruction following.

Execution Steps

Step 1: Prepare Preference Data

Construct or obtain preference pair datasets in the required format. Each example consists of a multimodal input (image/video + question) with a chosen response and a rejected response. The data can be constructed automatically using correctness-based sampling or VisualPRM-based methods provided in the repository.

Key considerations:

Each sample needs: image/video, question, chosen response, and rejected response
The MMPR-v1.1 dataset provides pre-built preference pairs for InternVL
Correctness-based construction: sample multiple model outputs and use ground truth to determine chosen/rejected
VisualPRM construction: use step-level reasoning quality to build preference pairs
Data is stored in JSONL format with question, chosen, and rejected fields

Step 2: Configure MPO Training

Set up the preference optimization configuration including loss type weights, learning rate, and reference model settings. The standard MPO configuration uses a weighted combination of sigmoid loss (weight 0.8) and BCO pair loss (weight 0.2).

Key considerations:

Loss type is set to sigmoid,bco_pair combining two preference losses
Very low learning rate (1e-6) to preserve capabilities from supervised training
All model components are unfrozen (vision, MLP, LLM)
Requires 256 GPUs for the standard 8B model configuration
Gradient accumulation of 256 is used to compensate for per-device batch size of 1
Liger kernel optimization is enabled for efficiency

Step 3: Load Model and Reference

Load the supervised fine-tuned InternVL model as both the trainable model and the reference model. The reference model is set to evaluation mode and its parameters are frozen. During training, both models process the same inputs, and the loss is computed from the difference in their log-probabilities for chosen versus rejected responses.

Key considerations:

Both models are initialized from the same supervised fine-tuned checkpoint
The reference model stays frozen throughout training (eval mode)
DeepSpeed ZeRO Stage 1 is used for distributed training
Memory usage is approximately doubled due to maintaining two model copies

Step 4: Train with MPO Loss

Run the preference optimization training loop using the MultimodalDPOTrainer. For each batch, the trainer computes forward passes through both the policy model and the reference model, then calculates the combined preference loss to update only the policy model's weights.

Key considerations:

The trainer returns 6 tensors per sample: chosen/rejected input_ids, labels, and attention_mask
Loss is computed as a weighted sum of sigmoid and BCO pair losses
Only the policy model is updated; the reference model remains unchanged
Training typically runs for 1 epoch over the preference dataset
Per-sample data collation handles the dual chosen/rejected format

Step 5: Export Aligned Model

Save the aligned model checkpoint. The trainer copies necessary model configuration files to the output directory alongside the trained weights. The aligned model can be used directly for inference.

Key considerations:

Model files are copied to the output directory for self-contained deployment
The reference model is discarded after training
The aligned model replaces the base model for downstream use
Verify alignment quality on representative examples before deployment

Execution Diagram

GitHub URL

Workflow Repository