Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haotian liu LLaVA Feature Alignment Pretraining

From Leeroopedia
Metadata
Knowledge Sources
Domains
Last Updated 2026-02-13 00:00 GMT

Overview

Training strategy that aligns visual features from a frozen vision encoder with a frozen language model by training only a lightweight projection layer. This is Stage 1 of LLaVA's two-stage training pipeline, focused on learning a mapping between the CLIP visual feature space and the LLM's text embedding space.

Description

Feature alignment pretraining (Stage 1 of LLaVA training) trains only the multimodal projector while keeping both the vision encoder (CLIP ViT-L/14-336) and the language model (e.g., Vicuna-13b-v1.5) completely frozen. The projector is a 2-layer MLP with GELU activation (mlp2x_gelu), mapping from CLIP's hidden dimension to the LLM's hidden dimension.

This stage uses 558K image-caption pairs from a filtered CC3M subset (blip_laion_cc_sbu_558k.json) with the plain conversation format, where each sample is a simple image-caption pair:

  • User: <image>
  • Assistant: {caption text}

The freezing mechanism operates as follows:

  1. model.requires_grad_(False) -- Freezes all parameters in the entire model (LLM + vision encoder + projector)
  2. for p in model.get_model().mm_projector.parameters(): p.requires_grad = True -- Selectively unfreezes only the projector parameters

This is triggered by the CLI argument --tune_mm_mlp_adapter True, which sets model_args.tune_mm_mlp_adapter = True.

Usage

Use this as the first stage when training a LLaVA model from scratch. Run before visual instruction tuning (Stage 2). Key characteristics:

  • Parameters trained: ~30M (projector only, out of ~13B total)
  • Training speed: Fast due to minimal gradient computation -- only the projector's small parameter set generates gradients
  • Memory efficiency: ZeRO-2 is sufficient since optimizer states are small
  • Training duration: 1 epoch over the 558K dataset
  • Learning rate: 1e-3 (relatively high, appropriate for randomly initialized projector)
  • Batch size: 32 per GPU

The output of this stage is a mm_projector.bin file containing only the trained projector weights, which is loaded in Stage 2 via --pretrain_mm_mlp_adapter.

Theoretical Basis

The multimodal projector learns a function f: R^(N x d_v) -> R^(N x d_l) that maps N visual tokens from the CLIP feature dimension d_v (1024 for ViT-L/14) to the LLM's embedding dimension d_l (5120 for Vicuna-13B). For LLaVA v1.5 with mlp2x_gelu, the architecture is:

Projector Architecture (mlp2x_gelu):
    Linear(d_v, d_l)    # 1024 -> 5120
    GELU()
    Linear(d_l, d_l)    # 5120 -> 5120

Total Parameters: d_v * d_l + d_l + d_l * d_l + d_l
                = 1024 * 5120 + 5120 + 5120 * 5120 + 5120
                = 5,242,880 + 5,120 + 26,214,400 + 5,120
                ≈ 31.5M parameters

The projector is constructed by build_vision_projector() in llava/model/multimodal_projector/builder.py:

# build_vision_projector() for 'mlp2x_gelu'
mlp_gelu_match = re.match(r'^mlp(\d+)x_gelu$', projector_type)
if mlp_gelu_match:
    mlp_depth = int(mlp_gelu_match.group(1))  # 2 for 'mlp2x_gelu'
    modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
    for _ in range(1, mlp_depth):
        modules.append(nn.GELU())
        modules.append(nn.Linear(config.hidden_size, config.hidden_size))
    return nn.Sequential(*modules)

During pretraining, the frozen vision encoder produces visual features that the projector must learn to translate into the language model's semantic space. Because both the encoder and LLM are frozen, the projector must adapt entirely -- learning to map CLIP's visual representations into tokens that the LLM interprets as meaningful visual descriptions.

The training configuration for Stage 1:

Stage 1 Pretraining Hyperparameters
Parameter Value
Vision encoder openai/clip-vit-large-patch14-336 (frozen)
Language model lmsys/vicuna-13b-v1.5 (frozen)
Projector type mlp2x_gelu (trained)
Dataset 558K image-caption pairs
Epochs 1
Batch size 32 per GPU
Learning rate 1e-3 (cosine schedule, 3% warmup)
DeepSpeed config ZeRO-2 (scripts/zero2.json)
Precision BF16
Max sequence length 2048

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment