Principle:Haotian liu LLaVA Feature Alignment Pretraining

**Metadata**
Knowledge Sources	Visual Instruction Tuning Improved Baselines with Visual Instruction Tuning
Domains	Transfer_Learning Vision_Language_Models
Last Updated	2026-02-13 00:00 GMT

Overview

Training strategy that aligns visual features from a frozen vision encoder with a frozen language model by training only a lightweight projection layer. This is Stage 1 of LLaVA's two-stage training pipeline, focused on learning a mapping between the CLIP visual feature space and the LLM's text embedding space.

Description

Feature alignment pretraining (Stage 1 of LLaVA training) trains only the multimodal projector while keeping both the vision encoder (CLIP ViT-L/14-336) and the language model (e.g., Vicuna-13b-v1.5) completely frozen. The projector is a 2-layer MLP with GELU activation (mlp2x_gelu), mapping from CLIP's hidden dimension to the LLM's hidden dimension.

This stage uses 558K image-caption pairs from a filtered CC3M subset (blip_laion_cc_sbu_558k.json) with the plain conversation format, where each sample is a simple image-caption pair:

User: <image>
Assistant: {caption text}

The freezing mechanism operates as follows:

model.requires_grad_(False) -- Freezes all parameters in the entire model (LLM + vision encoder + projector)
for p in model.get_model().mm_projector.parameters(): p.requires_grad = True -- Selectively unfreezes only the projector parameters

This is triggered by the CLI argument --tune_mm_mlp_adapter True, which sets model_args.tune_mm_mlp_adapter = True.

Usage

Use this as the first stage when training a LLaVA model from scratch. Run before visual instruction tuning (Stage 2). Key characteristics:

Parameters trained: ~30M (projector only, out of ~13B total)
Training speed: Fast due to minimal gradient computation -- only the projector's small parameter set generates gradients
Memory efficiency: ZeRO-2 is sufficient since optimizer states are small
Training duration: 1 epoch over the 558K dataset
Learning rate: 1e-3 (relatively high, appropriate for randomly initialized projector)
Batch size: 32 per GPU

The output of this stage is a mm_projector.bin file containing only the trained projector weights, which is loaded in Stage 2 via --pretrain_mm_mlp_adapter.

Theoretical Basis

The multimodal projector learns a function f: R^(N x d_v) -> R^(N x d_l) that maps N visual tokens from the CLIP feature dimension d_v (1024 for ViT-L/14) to the LLM's embedding dimension d_l (5120 for Vicuna-13B). For LLaVA v1.5 with mlp2x_gelu, the architecture is:

Projector Architecture (mlp2x_gelu):
    Linear(d_v, d_l)    # 1024 -> 5120
    GELU()
    Linear(d_l, d_l)    # 5120 -> 5120

Total Parameters: d_v * d_l + d_l + d_l * d_l + d_l
                = 1024 * 5120 + 5120 + 5120 * 5120 + 5120
                = 5,242,880 + 5,120 + 26,214,400 + 5,120
                ≈ 31.5M parameters

The projector is constructed by build_vision_projector() in llava/model/multimodal_projector/builder.py:

# build_vision_projector() for 'mlp2x_gelu'
mlp_gelu_match = re.match(r'^mlp(\d+)x_gelu$', projector_type)
if mlp_gelu_match:
    mlp_depth = int(mlp_gelu_match.group(1))  # 2 for 'mlp2x_gelu'
    modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
    for _ in range(1, mlp_depth):
        modules.append(nn.GELU())
        modules.append(nn.Linear(config.hidden_size, config.hidden_size))
    return nn.Sequential(*modules)

During pretraining, the frozen vision encoder produces visual features that the projector must learn to translate into the language model's semantic space. Because both the encoder and LLM are frozen, the projector must adapt entirely -- learning to map CLIP's visual representations into tokens that the LLM interprets as meaningful visual descriptions.

The training configuration for Stage 1:

**Stage 1 Pretraining Hyperparameters**
Parameter	Value
Vision encoder	`openai/clip-vit-large-patch14-336` (frozen)
Language model	`lmsys/vicuna-13b-v1.5` (frozen)
Projector type	`mlp2x_gelu` (trained)
Dataset	558K image-caption pairs
Epochs	1
Batch size	32 per GPU
Learning rate	1e-3 (cosine schedule, 3% warmup)
DeepSpeed config	ZeRO-2 (`scripts/zero2.json`)
Precision	BF16
Max sequence length	2048

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment