Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenGVLab InternVL Multi Stage Pretraining Pipeline

From Leeroopedia


Knowledge Sources
Domains Pretraining, Vision_Language, Curriculum_Learning
Last Updated 2026-02-07 00:00 GMT

Overview

A curriculum-based pretraining strategy that progressively unfreezes model components across three stages to bridge vision and language representations.

Description

Multi-stage pretraining follows a curriculum where increasingly more model parameters become trainable:

  • Stage 1 (MLP Warmup): Only the MLP projector is trainable. The vision encoder and LLM are frozen. This stage trains the projector to map visual features to the LLM embedding space. Uses the highest learning rate (2e-4) and runs for 100K steps.
  • Stage 1.5 (ViT Incremental): The vision encoder is unfrozen alongside the MLP. The LLM remains frozen. This adapts the visual representations while preserving language capabilities. Uses a lower learning rate (1e-5) and stochastic depth (drop_path=0.1).
  • Stage 2 (Instruction Tuning): All components are unfrozen. The model learns to follow instructions and generate appropriate responses conditioned on visual input. Uses moderate learning rate (4e-5) and switches to instruction-tuning data.

Usage

Use multi-stage pretraining when creating a new InternVL model from separately pretrained vision and language components. This is the standard recipe for training InternVL from scratch.

Theoretical Basis

The progressive unfreezing follows curriculum learning principles:

Stage Trainable Frozen LR Steps Data
Stage 1 MLP ViT + LLM 2e-4 100K Pretrain mixture
Stage 1.5 ViT + MLP LLM 1e-5 100K Pretrain mixture
Stage 2 ViT + MLP + LLM None 4e-5 5.5K Finetune mixture
# Pseudo-code: Progressive unfreezing
for stage in [1, 1.5, 2]:
    model = load_checkpoint(previous_stage_output)

    if stage == 1:
        freeze(model.vision_model)
        freeze(model.language_model)
        # Only model.mlp1 trains
    elif stage == 1.5:
        unfreeze(model.vision_model)
        freeze(model.language_model)
    elif stage == 2:
        unfreeze_all(model)

    trainer.train()

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment