Principle:Haotian liu LLaVA Model Loading
Overview
Unified procedure for loading pre-trained vision-language models with automatic detection of model type, architecture, and adapter configuration.
Description
Model loading in LLaVA handles multiple model variants through a single entry point. The loading procedure automatically detects and handles the following dimensions:
- Multimodal vs. plain language model -- Detects whether the model is a LLaVA multimodal model (contains 'llava' in the name) or a plain language model.
- Language model architecture -- Identifies the underlying LLM architecture from the model name:
- LLaMA -- Default architecture
- Mistral -- Detected by 'mistral' in model name
- MPT -- Detected by 'mpt' in model name
- LoRA adapter detection -- If 'lora' is present in the model name, loads the base model first, then applies and merges LoRA adapter weights.
- Projector-only checkpoint -- If a
model_baseis provided without LoRA indicators, loads the base model and replaces the multimodal projector weights. - Quantization settings -- Supports 4-bit NF4 quantization and 8-bit quantization via
BitsAndBytesConfig.
After loading the language model and any adapters, the procedure:
- Initializes the vision tower (CLIP ViT-L/14) if not already loaded
- Returns the complete inference stack: tokenizer, model, image_processor, context_len
Usage
Use as the primary model loading function for any LLaVA inference or evaluation task. This single function handles all model variants automatically, eliminating the need for variant-specific loading code.
Common scenarios:
- Standard model -- Provide only
model_path - LoRA model -- Provide
model_path(adapter) +model_base(base model) - Quantized inference -- Add
load_4bit=Trueorload_8bit=True
Theoretical Basis
Model type detection uses string matching on model_name:
'llava'in name -- multimodal model path'lora'in name -- LoRA adapter mode'mpt'in name -- MPT architecture selection'mistral'in name -- Mistral architecture selection
LoRA loading sequence:
- Load the base model with full precision (or quantized)
- Load
non_lora_trainables.bin(additional non-LoRA weights like the projector) - Apply LoRA adapters via
PeftModel.from_pretrained() - Merge and unload LoRA weights via
merge_and_unload()for inference efficiency
Quantization: 4-bit uses NF4 (Normal Float 4) with double quantization and bfloat16 compute type. 8-bit uses the default BitsAndBytesConfig 8-bit configuration.
Metadata
| Field | Value |
|---|---|
| Knowledge Sources | Repo - LLaVA - https://github.com/haotian-liu/LLaVA |
| Domains | Model_Management, Inference |
| Last Updated | 2026-02-13 14:00 GMT |