Workflow:OpenGVLab InternVL Multi Stage Pretraining
| Knowledge Sources | |
|---|---|
| Domains | VLMs, Pretraining, Multimodal, Distributed_Training |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
End-to-end multi-stage pretraining pipeline for building InternVL vision-language models from separate vision encoder and language model components.
Description
This workflow covers the complete multi-stage training pipeline used to produce InternVL models from scratch. The process follows a progressive unfreezing strategy across three stages: Stage 1 trains only the MLP projector to align vision and language representations, Stage 1.5 unfreezes all components for intermediate alignment with vision-centric data, and Stage 2 performs full supervised instruction tuning on diverse multimodal tasks. Each stage uses packed sequence training for GPU efficiency and DeepSpeed ZeRO for distributed training across hundreds of GPUs.
Usage
Execute this workflow when you need to train a new InternVL model from pre-existing vision encoder and language model components, or when you want to reproduce the official InternVL training pipeline. This requires significant compute resources (512+ GPUs) and large-scale training datasets. Most users should use the 2nd finetune or LoRA workflows instead, which start from an already-trained InternVL checkpoint.
Execution Steps
Step 1: Prepare Component Models
Obtain the pre-trained vision encoder (InternViT-300M-448px or InternViT-6B) and language model (InternLM2.5, Qwen2.5, etc.) checkpoints. These serve as the frozen initialization for the composite InternVL model. The MLP projector is initialized randomly and will be trained from scratch.
Key considerations:
- Vision encoder and LLM are loaded from separate checkpoint directories
- The MLP projector connects the vision encoder output space to the LLM input space
- Pixel shuffle downsampling reduces vision tokens by 4:1 spatially
- Select component sizes to match your target model (e.g., InternViT-300M + InternLM2.5-7B for the 8B model)
Step 2: Prepare Multi-Stage Datasets
Assemble the training data mixtures for each stage. Stage 1 uses image-caption pairs for vision-language alignment. Stage 1.5 uses vision-centric data with richer visual understanding tasks. Stage 2 uses diverse instruction-following multimodal data including conversations, VQA, OCR, and reasoning tasks.
Key considerations:
- Each stage requires a separate dataset meta-file defining the mixture
- Stage 1 focuses on caption-style data for basic alignment
- Stage 2 includes instruction-following data across many multimodal tasks
- Data is stored in JSONL format with the same conversation schema used for fine-tuning
- Video data support with configurable frame sampling (8-32 frames)
Step 3: Stage 1 MLP Warmup
Train only the MLP projector while keeping the vision encoder and LLM completely frozen. This stage learns the projection from vision feature space to language embedding space. Uses packed sequence training with sequences up to 16384 tokens for efficient GPU utilization.
Key considerations:
- Only MLP projector parameters are trainable (vision and LLM frozen)
- Learning rate of 2e-4 with cosine scheduler
- Trained for 100,000 steps with packed sequences
- Uses 512 GPUs with per-device batch size of 1 (packed training makes effective batch larger)
- DeepSpeed ZeRO Stage 1 is sufficient since few parameters are updated
Step 4: Stage 1.5 ViT Incremental Learning
Unfreeze all components (vision encoder, MLP, and LLM) for intermediate alignment. This stage uses vision-centric data to improve the model's visual understanding capabilities while maintaining language abilities. Applied only to certain model sizes (8B, 26B).
Key considerations:
- All three components (ViT, MLP, LLM) are now trainable
- Uses a lower learning rate than Stage 1 for stable joint training
- This stage is optional and not applied to all model configurations
- The checkpoint from this stage serves as initialization for Stage 2
Step 5: Stage 2 Full Instruction Tuning
Unfreeze all components and train on diverse instruction-following multimodal data. This is the final pretraining stage that gives the model its general multimodal conversation abilities. Drop path regularization (0.1) is applied to prevent overfitting.
Key considerations:
- All parameters are trainable with a lower learning rate (4e-5)
- Drop path rate of 0.1 is applied for regularization
- Trained for 5,500 steps with packed sequences
- Uses 512 GPUs for the 8B model size
- Packed sequence training packs multiple samples into 16384-token sequences
- Produces the final instruction-tuned model checkpoint
Step 6: Validate Trained Model
Run the trained model through the evaluation benchmark suite to verify quality. Compare results against published baselines for the target model size. Convert checkpoint format if needed for deployment.
Key considerations:
- Run evaluation on standard benchmarks (MMBench, MMMU, MathVista, etc.)
- Compare against published results for the corresponding model size
- Use format conversion tools to export to HuggingFace Hub format