Principle:Haotian liu LLaVA LoRA Training
Overview
Training procedure that applies LoRA adapters to a pre-trained LLaVA model and trains them on task-specific data.
Description
LoRA training in LLaVA uses the same train() function as full finetuning but with lora_enable=True. The procedure follows these steps:
- Load the base LLaVA model (LlavaLlamaForCausalLM.from_pretrained())
- If using QLoRA (bits=4 or 8), quantize the base model via BitsAndBytesConfig and prepare it with prepare_model_for_kbit_training()
- Auto-detect target linear layers via find_all_linear_names()
- Create a LoraConfig and apply LoRA adapters via get_peft_model()
- Initialize vision modules, tokenizer, and data pipeline
- Train with standard cross-entropy loss using LLaVATrainer
- Save only LoRA adapter weights and non-LoRA trainables separately
Only LoRA adapter weights and non-LoRA trainables (mm_projector) are saved at checkpoint time. The custom LLaVATrainer handles adapter-aware checkpoint saving through get_peft_state_maybe_zero_3() and get_peft_state_non_lora_maybe_zero_3().
Usage
Use this when you want parameter-efficient finetuning of LLaVA on custom visual instruction data. Requires a pre-trained LLaVA checkpoint (or base LLM + pretrained mm_projector) as the starting point. This approach is recommended when:
- You have limited task-specific data
- GPU memory is constrained
- You want to maintain multiple task-specific adapters sharing one base model
Theoretical Basis
After get_peft_model() wraps the model, only LoRA parameters (A and B matrices) have requires_grad=True. The base model weights remain frozen, and gradients flow only through the low-rank adapters.
At save time:
- get_peft_state_maybe_zero_3() extracts only LoRA weights (parameters containing "lora_" in their name), handling DeepSpeed ZeRO-3 parameter gathering
- get_peft_state_non_lora_maybe_zero_3() extracts non-LoRA trainable parameters (primarily mm_projector weights) into non_lora_trainables.bin
- model.save_pretrained() saves the LoRA adapter configuration and weights (adapter_config.json, adapter_model.bin)
This enables efficient checkpoint storage: approximately 100MB for LoRA adapters vs ~26GB for a full 13B model checkpoint.
The mm_projector_lr parameter allows training the multimodal projector at a different (typically lower) learning rate than the LoRA adapters, providing fine-grained control over the adaptation of different model components.
Knowledge Sources
- Paper -- LoRA: Low-Rank Adaptation of Large Language Models -- https://arxiv.org/abs/2106.09685
- Repo -- LLaVA -- https://github.com/haotian-liu/LLaVA
Domains
- Fine_Tuning
- Parameter_Efficient_Fine_Tuning
Metadata
| Field | Value |
|---|---|
| last_updated | 2026-02-13 14:00 GMT |
| source_repo | Haotian_liu_LLaVA |
| commit | 799f5f207c89 |
| type | Principle |