Workflow:Haotian liu LLaVA Single Image Inference
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Inference, Multimodal, Vision_Language |
| Last Updated | 2026-02-13 23:00 GMT |
Overview
Run single-image visual question answering inference using a pre-trained LLaVA model from the command line.
Description
This workflow demonstrates the minimal inference pathway for querying a LLaVA model with an image and a text prompt. It covers loading a pre-trained or fine-tuned LLaVA checkpoint, preprocessing an image through the CLIP vision pipeline, constructing the appropriate conversation template, and generating a text response. The workflow handles multiple model backends (LLaMA, Vicuna, Mistral, MPT) with automatic conversation mode detection.
This is the simplest way to programmatically use LLaVA for inference and serves as the foundation for building custom inference applications.
Usage
Execute this workflow when you want to:
- Run a quick inference test on a single image with a text query
- Integrate LLaVA inference into a custom application or script
- Verify that a trained model checkpoint produces correct visual understanding responses
Execution Steps
Step 1: Load Model and Processor
Initialize the LLaVA model using the unified model builder, which handles the complete loading pipeline: tokenizer initialization, language model instantiation (with optional LoRA or quantization), vision tower loading, and image processor extraction. The builder automatically detects the model variant (LLaMA, Mistral, MPT) from the model name.
What happens:
- load_pretrained_model() is called with the model path and optional base model
- The tokenizer is loaded with special image tokens registered
- The language model is instantiated with the multimodal architecture
- The CLIP vision tower is loaded and its image processor extracted
- For LoRA models, adapter weights are loaded and optionally merged
Step 2: Preprocess Image
Load the input image (from a local file or URL) and process it through the CLIP image processor. The processor applies resizing, center cropping, normalization, and optional aspect-ratio-preserving padding to produce the tensor format expected by the vision encoder.
Key considerations:
- Images can be loaded from local paths or HTTP/HTTPS URLs
- The processor handles PIL Image conversion to RGB
- Image tensor is cast to float16 and moved to the model's device
- Multiple images can be processed for multi-image queries (comma-separated paths)
Step 3: Construct Conversation Prompt
Build the input prompt using the appropriate conversation template for the model variant. The template inserts the image token placeholder at the correct position and applies the model-specific formatting (system message, role prefixes, separators).
What happens:
- Conversation mode is auto-detected from model name (llava_v1, llava_llama_2, mistral_instruct, mpt)
- The <image> token is prepended to the user query
- The conversation template formats the prompt with appropriate role markers
- tokenizer_image_token() converts the prompt to token IDs with the IMAGE_TOKEN_INDEX placeholder
Step 4: Generate Response
Run the model's generate method with the tokenized prompt and image tensor. The generation uses the multimodal forward pass which replaces image token positions with CLIP visual embeddings projected through the MLP, then autoregressively generates text tokens.
What happens:
- The model's prepare_inputs_labels_for_multimodal() fuses image features into the token sequence
- Autoregressive generation produces output tokens using the configured sampling strategy
- Temperature, top-p, beam search, and max token parameters control generation behavior
- The output is decoded and stripped of special tokens to produce the final text response