Principle:OpenGVLab InternVL Multi Stage Pretraining Pipeline

Knowledge Sources	InternVL 2.5 InternVL 1.0 InternVL
Domains	Pretraining, Vision_Language, Curriculum_Learning
Last Updated	2026-02-07 00:00 GMT

Overview

A curriculum-based pretraining strategy that progressively unfreezes model components across three stages to bridge vision and language representations.

Description

Multi-stage pretraining follows a curriculum where increasingly more model parameters become trainable:

Stage 1 (MLP Warmup): Only the MLP projector is trainable. The vision encoder and LLM are frozen. This stage trains the projector to map visual features to the LLM embedding space. Uses the highest learning rate (2e-4) and runs for 100K steps.

Stage 1.5 (ViT Incremental): The vision encoder is unfrozen alongside the MLP. The LLM remains frozen. This adapts the visual representations while preserving language capabilities. Uses a lower learning rate (1e-5) and stochastic depth (drop_path=0.1).

Stage 2 (Instruction Tuning): All components are unfrozen. The model learns to follow instructions and generate appropriate responses conditioned on visual input. Uses moderate learning rate (4e-5) and switches to instruction-tuning data.

Usage

Use multi-stage pretraining when creating a new InternVL model from separately pretrained vision and language components. This is the standard recipe for training InternVL from scratch.

Theoretical Basis

The progressive unfreezing follows curriculum learning principles:

Stage	Trainable	Frozen	LR	Steps	Data
Stage 1	MLP	ViT + LLM	2e-4	100K	Pretrain mixture
Stage 1.5	ViT + MLP	LLM	1e-5	100K	Pretrain mixture
Stage 2	ViT + MLP + LLM	None	4e-5	5.5K	Finetune mixture

# Pseudo-code: Progressive unfreezing
for stage in [1, 1.5, 2]:
    model = load_checkpoint(previous_stage_output)

    if stage == 1:
        freeze(model.vision_model)
        freeze(model.language_model)
        # Only model.mlp1 trains
    elif stage == 1.5:
        unfreeze(model.vision_model)
        freeze(model.language_model)
    elif stage == 2:
        unfreeze_all(model)

    trainer.train()

Related Pages

Implemented By

Implementation:OpenGVLab_InternVL_Pretrain_Main

Uses Heuristic

Heuristic:OpenGVLab_InternVL_Gradient_Checkpointing_Memory

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment