Workflow:OpenGVLab InternVL Multi Stage Pretraining

Knowledge Sources	InternVL InternVL Chat README InternVL 2.5
Domains	VLMs, Pretraining, Multimodal, Distributed_Training
Last Updated	2026-02-07 14:00 GMT

Overview

End-to-end multi-stage pretraining pipeline for building InternVL vision-language models from separate vision encoder and language model components.

Description

This workflow covers the complete multi-stage training pipeline used to produce InternVL models from scratch. The process follows a progressive unfreezing strategy across three stages: Stage 1 trains only the MLP projector to align vision and language representations, Stage 1.5 unfreezes all components for intermediate alignment with vision-centric data, and Stage 2 performs full supervised instruction tuning on diverse multimodal tasks. Each stage uses packed sequence training for GPU efficiency and DeepSpeed ZeRO for distributed training across hundreds of GPUs.

Usage

Execute this workflow when you need to train a new InternVL model from pre-existing vision encoder and language model components, or when you want to reproduce the official InternVL training pipeline. This requires significant compute resources (512+ GPUs) and large-scale training datasets. Most users should use the 2nd finetune or LoRA workflows instead, which start from an already-trained InternVL checkpoint.

Execution Steps

Step 1: Prepare Component Models

Obtain the pre-trained vision encoder (InternViT-300M-448px or InternViT-6B) and language model (InternLM2.5, Qwen2.5, etc.) checkpoints. These serve as the frozen initialization for the composite InternVL model. The MLP projector is initialized randomly and will be trained from scratch.

Key considerations:

Vision encoder and LLM are loaded from separate checkpoint directories
The MLP projector connects the vision encoder output space to the LLM input space
Pixel shuffle downsampling reduces vision tokens by 4:1 spatially
Select component sizes to match your target model (e.g., InternViT-300M + InternLM2.5-7B for the 8B model)

Step 2: Prepare Multi-Stage Datasets

Assemble the training data mixtures for each stage. Stage 1 uses image-caption pairs for vision-language alignment. Stage 1.5 uses vision-centric data with richer visual understanding tasks. Stage 2 uses diverse instruction-following multimodal data including conversations, VQA, OCR, and reasoning tasks.

Key considerations:

Each stage requires a separate dataset meta-file defining the mixture
Stage 1 focuses on caption-style data for basic alignment
Stage 2 includes instruction-following data across many multimodal tasks
Data is stored in JSONL format with the same conversation schema used for fine-tuning
Video data support with configurable frame sampling (8-32 frames)

Step 3: Stage 1 MLP Warmup

Train only the MLP projector while keeping the vision encoder and LLM completely frozen. This stage learns the projection from vision feature space to language embedding space. Uses packed sequence training with sequences up to 16384 tokens for efficient GPU utilization.

Key considerations:

Only MLP projector parameters are trainable (vision and LLM frozen)
Learning rate of 2e-4 with cosine scheduler
Trained for 100,000 steps with packed sequences
Uses 512 GPUs with per-device batch size of 1 (packed training makes effective batch larger)
DeepSpeed ZeRO Stage 1 is sufficient since few parameters are updated

Step 4: Stage 1.5 ViT Incremental Learning

Unfreeze all components (vision encoder, MLP, and LLM) for intermediate alignment. This stage uses vision-centric data to improve the model's visual understanding capabilities while maintaining language abilities. Applied only to certain model sizes (8B, 26B).

Key considerations:

All three components (ViT, MLP, LLM) are now trainable
Uses a lower learning rate than Stage 1 for stable joint training
This stage is optional and not applied to all model configurations
The checkpoint from this stage serves as initialization for Stage 2

Step 5: Stage 2 Full Instruction Tuning

Unfreeze all components and train on diverse instruction-following multimodal data. This is the final pretraining stage that gives the model its general multimodal conversation abilities. Drop path regularization (0.1) is applied to prevent overfitting.

Key considerations:

All parameters are trainable with a lower learning rate (4e-5)
Drop path rate of 0.1 is applied for regularization
Trained for 5,500 steps with packed sequences
Uses 512 GPUs for the 8B model size
Packed sequence training packs multiple samples into 16384-token sequences
Produces the final instruction-tuned model checkpoint

Step 6: Validate Trained Model

Run the trained model through the evaluation benchmark suite to verify quality. Compare results against published baselines for the target model size. Convert checkpoint format if needed for deployment.

Key considerations:

Run evaluation on standard benchmarks (MMBench, MMMU, MathVista, etc.)
Compare against published results for the corresponding model size
Use format conversion tools to export to HuggingFace Hub format

Execution Diagram

GitHub URL

Workflow Repository