Principle:Haotian liu LLaVA Visual Instruction Tuning

**Metadata**
Knowledge Sources	Visual Instruction Tuning Improved Baselines with Visual Instruction Tuning
Domains	Instruction_Tuning Vision_Language_Models
Last Updated	2026-02-13 00:00 GMT

Overview

Training strategy that fine-tunes a vision-language model end-to-end on visual instruction-following data to enable conversational visual reasoning. This is Stage 2 of LLaVA's two-stage training pipeline, where the full language model is unfrozen and trained jointly with the pretrained multimodal projector on multi-turn visual conversations.

Description

Visual instruction tuning (Stage 2 of LLaVA training) unfreezes the language model and trains it jointly with the multimodal projector on 665K visual instruction-following conversations (llava_v1_5_mix665k.json). The pretrained projector weights from Stage 1 are loaded via --pretrain_mm_mlp_adapter, and the full LLM is trained with a lower learning rate (2e-5 vs 1e-3 in Stage 1) to preserve the pre-existing language capabilities while adapting them for visual reasoning.

A custom LLaVATrainer extends HuggingFace's Trainer with three key modifications:

Modality-length-grouped sampling -- The _get_train_sampler() method returns a custom LengthGroupedSampler that separates image-containing and text-only samples into distinct batches. This prevents mixing modalities within a batch, reducing padding waste when image samples (with visual tokens) are much longer than text-only samples.

Separate projector learning rate -- The create_optimizer() method supports an optional mm_projector_lr parameter that allows the projector to train at a different learning rate than the LLM. When set, four optimizer parameter groups are created: LLM with/without weight decay, and projector with/without weight decay.

Custom checkpoint saving -- The _save_checkpoint() and _save() methods are overridden to support selective weight saving. When tune_mm_mlp_adapter is True (Stage 1), only projector weights are saved. For Stage 2, the full model is saved via DeepSpeed's ZeRO-3 checkpoint mechanism.

Usage

Use this as the second stage after feature alignment pretraining. This stage produces the final LLaVA model capable of multi-turn visual conversations.

Key differences from Stage 1:

**Stage 1 vs Stage 2 Comparison**
Aspect	Stage 1 (Pretraining)	Stage 2 (Finetuning)
Frozen components	LLM + Vision Encoder	Vision Encoder only
Trained components	Projector only	LLM + Projector
Dataset	558K image-caption pairs	665K instruction-following conversations
Conversation format	`plain`	`v1`
Learning rate	1e-3	2e-5
Batch size per GPU	32	16
DeepSpeed config	ZeRO-2	ZeRO-3
Image aspect ratio	`square`	`pad`
Modality grouping	No	Yes (`--group_by_modality_length True`)

Theoretical Basis

The training loss is standard autoregressive cross-entropy computed on assistant tokens only. User turns are masked with IGNORE_INDEX = -100, which is the default ignore_index value for PyTorch's CrossEntropyLoss:

Loss = -1/T * SUM_{t in assistant_tokens} log P(x_t | x_{<t}, image)

Where:
    T = number of assistant tokens (unmasked positions)
    x_t = token at position t
    x_{<t} = all preceding tokens (including image tokens)
    image = CLIP-encoded visual features projected into LLM space

Modality-Length-Grouped Sampling

The LengthGroupedSampler with group_by_modality=True implements a two-phase batching strategy:

1. SEPARATE by modality:
   mm_indices  = [i for i, l in enumerate(lengths) if l > 0]   # image samples
   lang_indices = [i for i, l in enumerate(lengths) if l < 0]   # text-only samples

2. SORT within each modality by sequence length (descending)

3. FORM megabatches of size (world_size * batch_size) within each modality

4. SHUFFLE megabatches across modalities (but not within)

5. COMBINE: remaining partial batches from both modalities are merged last

This grouping reduces padding waste by ensuring that samples within a batch have similar lengths, and prevents the extreme length mismatch that occurs when mixing image samples (~2048 tokens with visual embeddings) with text-only samples (~200 tokens).

DeepSpeed ZeRO-3

Stage 2 uses ZeRO-3 because the full 13B-parameter LLM is now trainable. ZeRO-3 partitions parameters, gradients, and optimizer states across all GPUs:

Parameter all-gather occurs before each forward/backward layer computation
Gradient reduce-scatter occurs after backward pass
stage3_gather_16bit_weights_on_model_save=true ensures full model weights are reconstructed on rank 0 during checkpoint saving

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment