Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haotian liu LLaVA Load Pretrained Model

From Leeroopedia

Overview

Unified model loading function that handles all LLaVA model variants, architectures, and adapter configurations through a single entry point.

Source

  • File: llava/model/builder.py
  • Lines: L26-167

Signature

def load_pretrained_model(
    model_path: str,
    model_base: str,
    model_name: str,
    load_8bit: bool = False,
    load_4bit: bool = False,
    device_map: str = "auto",
    device: str = "cuda",
    use_flash_attn: bool = False,
    **kwargs
) -> Tuple[AutoTokenizer, LlavaLlamaForCausalLM, CLIPImageProcessor, int]:
    """
    Load a LLaVA model with automatic variant detection.

    Returns:
        Tuple of (tokenizer, model, image_processor, context_len)
    """

Import

from llava.model.builder import load_pretrained_model

Inputs

Parameter Type Required Default Description
model_path str Yes -- HuggingFace model ID or local checkpoint path
model_base str For LoRA/projector None Base model path (required for LoRA adapters and projector-only checkpoints)
model_name str Yes -- Model name string used for architecture detection (e.g., 'llava-v1.5-13b')
load_8bit bool No False Enable 8-bit quantization via BitsAndBytes
load_4bit bool No False Enable 4-bit NF4 quantization via BitsAndBytes
device_map str No "auto" Device mapping strategy for model parallelism
device str No "cuda" Target device for model loading
use_flash_attn bool No False Enable Flash Attention 2 for faster inference

Outputs

Output Type Description
tokenizer AutoTokenizer Tokenizer for the loaded model
model LlavaLlamaForCausalLM Loaded model (or LlavaMistralForCausalLM, LlavaMptForCausalLM)
image_processor CLIPImageProcessor CLIP image preprocessor from the vision tower
context_len int Maximum context length for the model

Usage Examples

Standard model

from llava.model.builder import load_pretrained_model

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path="liuhaotian/llava-v1.5-13b",
    model_base=None,
    model_name="llava-v1.5-13b"
)

LoRA model

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path="/path/to/llava-v1.5-13b-lora",
    model_base="liuhaotian/llava-v1.5-13b",
    model_name="llava-v1.5-13b-lora"
)

4-bit quantized

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path="liuhaotian/llava-v1.5-13b",
    model_base=None,
    model_name="llava-v1.5-13b",
    load_4bit=True
)

Description

load_pretrained_model() is the single entry point for loading any LLaVA model variant. It follows a decision tree based on the model name:

Decision flow:

  1. Check for quantization -- If load_4bit or load_8bit, configure BitsAndBytesConfig.
  2. Detect model type:
    • If 'llava' in model name and model_base provided:
      • If 'lora' in model name -- Load base model, apply LoRA adapters, merge and unload
      • Otherwise -- Load base model, replace projector weights from checkpoint
    • If 'llava' in model name and no model_base -- Load full LLaVA model directly
    • Otherwise -- Load as plain language model (no vision components)
  3. Detect architecture from model name: LLaMA (default), Mistral ('mistral'), MPT ('mpt')
  4. Initialize vision tower -- Call model.get_vision_tower() and load CLIP weights if not already loaded
  5. Set model to eval mode and return the full inference stack

LoRA merge process:

  • Loads non_lora_trainables.bin from the adapter checkpoint (contains projector weights)
  • Applies LoRA adapters via PeftModel.from_pretrained()
  • Calls merge_and_unload() to fold LoRA weights into the base model for efficient inference

Metadata

Field Value
Knowledge Sources Repo - LLaVA - https://github.com/haotian-liu/LLaVA
Domains Model_Management, Inference
Last Updated 2026-02-13 14:00 GMT

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment