Principle:Haotian liu LLaVA Feature Alignment Pretraining
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Training strategy that aligns visual features from a frozen vision encoder with a frozen language model by training only a lightweight projection layer. This is Stage 1 of LLaVA's two-stage training pipeline, focused on learning a mapping between the CLIP visual feature space and the LLM's text embedding space.
Description
Feature alignment pretraining (Stage 1 of LLaVA training) trains only the multimodal projector while keeping both the vision encoder (CLIP ViT-L/14-336) and the language model (e.g., Vicuna-13b-v1.5) completely frozen. The projector is a 2-layer MLP with GELU activation (mlp2x_gelu), mapping from CLIP's hidden dimension to the LLM's hidden dimension.
This stage uses 558K image-caption pairs from a filtered CC3M subset (blip_laion_cc_sbu_558k.json) with the plain conversation format, where each sample is a simple image-caption pair:
- User:
<image> - Assistant:
{caption text}
The freezing mechanism operates as follows:
model.requires_grad_(False)-- Freezes all parameters in the entire model (LLM + vision encoder + projector)for p in model.get_model().mm_projector.parameters(): p.requires_grad = True-- Selectively unfreezes only the projector parameters
This is triggered by the CLI argument --tune_mm_mlp_adapter True, which sets model_args.tune_mm_mlp_adapter = True.
Usage
Use this as the first stage when training a LLaVA model from scratch. Run before visual instruction tuning (Stage 2). Key characteristics:
- Parameters trained: ~30M (projector only, out of ~13B total)
- Training speed: Fast due to minimal gradient computation -- only the projector's small parameter set generates gradients
- Memory efficiency: ZeRO-2 is sufficient since optimizer states are small
- Training duration: 1 epoch over the 558K dataset
- Learning rate:
1e-3(relatively high, appropriate for randomly initialized projector) - Batch size: 32 per GPU
The output of this stage is a mm_projector.bin file containing only the trained projector weights, which is loaded in Stage 2 via --pretrain_mm_mlp_adapter.
Theoretical Basis
The multimodal projector learns a function f: R^(N x d_v) -> R^(N x d_l) that maps N visual tokens from the CLIP feature dimension d_v (1024 for ViT-L/14) to the LLM's embedding dimension d_l (5120 for Vicuna-13B). For LLaVA v1.5 with mlp2x_gelu, the architecture is:
Projector Architecture (mlp2x_gelu):
Linear(d_v, d_l) # 1024 -> 5120
GELU()
Linear(d_l, d_l) # 5120 -> 5120
Total Parameters: d_v * d_l + d_l + d_l * d_l + d_l
= 1024 * 5120 + 5120 + 5120 * 5120 + 5120
= 5,242,880 + 5,120 + 26,214,400 + 5,120
≈ 31.5M parameters
The projector is constructed by build_vision_projector() in llava/model/multimodal_projector/builder.py:
# build_vision_projector() for 'mlp2x_gelu'
mlp_gelu_match = re.match(r'^mlp(\d+)x_gelu$', projector_type)
if mlp_gelu_match:
mlp_depth = int(mlp_gelu_match.group(1)) # 2 for 'mlp2x_gelu'
modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
for _ in range(1, mlp_depth):
modules.append(nn.GELU())
modules.append(nn.Linear(config.hidden_size, config.hidden_size))
return nn.Sequential(*modules)
During pretraining, the frozen vision encoder produces visual features that the projector must learn to translate into the language model's semantic space. Because both the encoder and LLM are frozen, the projector must adapt entirely -- learning to map CLIP's visual representations into tokens that the LLM interprets as meaningful visual descriptions.
The training configuration for Stage 1:
| Parameter | Value |
|---|---|
| Vision encoder | openai/clip-vit-large-patch14-336 (frozen)
|
| Language model | lmsys/vicuna-13b-v1.5 (frozen)
|
| Projector type | mlp2x_gelu (trained)
|
| Dataset | 558K image-caption pairs |
| Epochs | 1 |
| Batch size | 32 per GPU |
| Learning rate | 1e-3 (cosine schedule, 3% warmup) |
| DeepSpeed config | ZeRO-2 (scripts/zero2.json)
|
| Precision | BF16 |
| Max sequence length | 2048 |