Principle:Haotian liu LLaVA Visual Instruction Tuning
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Training strategy that fine-tunes a vision-language model end-to-end on visual instruction-following data to enable conversational visual reasoning. This is Stage 2 of LLaVA's two-stage training pipeline, where the full language model is unfrozen and trained jointly with the pretrained multimodal projector on multi-turn visual conversations.
Description
Visual instruction tuning (Stage 2 of LLaVA training) unfreezes the language model and trains it jointly with the multimodal projector on 665K visual instruction-following conversations (llava_v1_5_mix665k.json). The pretrained projector weights from Stage 1 are loaded via --pretrain_mm_mlp_adapter, and the full LLM is trained with a lower learning rate (2e-5 vs 1e-3 in Stage 1) to preserve the pre-existing language capabilities while adapting them for visual reasoning.
A custom LLaVATrainer extends HuggingFace's Trainer with three key modifications:
- Modality-length-grouped sampling -- The
_get_train_sampler()method returns a customLengthGroupedSamplerthat separates image-containing and text-only samples into distinct batches. This prevents mixing modalities within a batch, reducing padding waste when image samples (with visual tokens) are much longer than text-only samples.
- Separate projector learning rate -- The
create_optimizer()method supports an optionalmm_projector_lrparameter that allows the projector to train at a different learning rate than the LLM. When set, four optimizer parameter groups are created: LLM with/without weight decay, and projector with/without weight decay.
- Custom checkpoint saving -- The
_save_checkpoint()and_save()methods are overridden to support selective weight saving. Whentune_mm_mlp_adapteris True (Stage 1), only projector weights are saved. For Stage 2, the full model is saved via DeepSpeed's ZeRO-3 checkpoint mechanism.
Usage
Use this as the second stage after feature alignment pretraining. This stage produces the final LLaVA model capable of multi-turn visual conversations.
Key differences from Stage 1:
| Aspect | Stage 1 (Pretraining) | Stage 2 (Finetuning) |
|---|---|---|
| Frozen components | LLM + Vision Encoder | Vision Encoder only |
| Trained components | Projector only | LLM + Projector |
| Dataset | 558K image-caption pairs | 665K instruction-following conversations |
| Conversation format | plain |
v1
|
| Learning rate | 1e-3 | 2e-5 |
| Batch size per GPU | 32 | 16 |
| DeepSpeed config | ZeRO-2 | ZeRO-3 |
| Image aspect ratio | square |
pad
|
| Modality grouping | No | Yes (--group_by_modality_length True)
|
Theoretical Basis
The training loss is standard autoregressive cross-entropy computed on assistant tokens only. User turns are masked with IGNORE_INDEX = -100, which is the default ignore_index value for PyTorch's CrossEntropyLoss:
Loss = -1/T * SUM_{t in assistant_tokens} log P(x_t | x_{<t}, image)
Where:
T = number of assistant tokens (unmasked positions)
x_t = token at position t
x_{<t} = all preceding tokens (including image tokens)
image = CLIP-encoded visual features projected into LLM space
Modality-Length-Grouped Sampling
The LengthGroupedSampler with group_by_modality=True implements a two-phase batching strategy:
1. SEPARATE by modality:
mm_indices = [i for i, l in enumerate(lengths) if l > 0] # image samples
lang_indices = [i for i, l in enumerate(lengths) if l < 0] # text-only samples
2. SORT within each modality by sequence length (descending)
3. FORM megabatches of size (world_size * batch_size) within each modality
4. SHUFFLE megabatches across modalities (but not within)
5. COMBINE: remaining partial batches from both modalities are merged last
This grouping reduces padding waste by ensuring that samples within a batch have similar lengths, and prevents the extreme length mismatch that occurs when mixing image samples (~2048 tokens with visual embeddings) with text-only samples (~200 tokens).
DeepSpeed ZeRO-3
Stage 2 uses ZeRO-3 because the full 13B-parameter LLM is now trainable. ZeRO-3 partitions parameters, gradients, and optimizer states across all GPUs:
- Parameter all-gather occurs before each forward/backward layer computation
- Gradient reduce-scatter occurs after backward pass
stage3_gather_16bit_weights_on_model_save=trueensures full model weights are reconstructed on rank 0 during checkpoint saving