Workflow:Haotian liu LLaVA Single Image Inference

Knowledge Sources	LLaVA
Domains	LLMs, Inference, Multimodal, Vision_Language
Last Updated	2026-02-13 23:00 GMT

Overview

Run single-image visual question answering inference using a pre-trained LLaVA model from the command line.

Description

This workflow demonstrates the minimal inference pathway for querying a LLaVA model with an image and a text prompt. It covers loading a pre-trained or fine-tuned LLaVA checkpoint, preprocessing an image through the CLIP vision pipeline, constructing the appropriate conversation template, and generating a text response. The workflow handles multiple model backends (LLaMA, Vicuna, Mistral, MPT) with automatic conversation mode detection.

This is the simplest way to programmatically use LLaVA for inference and serves as the foundation for building custom inference applications.

Usage

Execute this workflow when you want to:

Run a quick inference test on a single image with a text query
Integrate LLaVA inference into a custom application or script
Verify that a trained model checkpoint produces correct visual understanding responses

Execution Steps

Step 1: Load Model and Processor

Initialize the LLaVA model using the unified model builder, which handles the complete loading pipeline: tokenizer initialization, language model instantiation (with optional LoRA or quantization), vision tower loading, and image processor extraction. The builder automatically detects the model variant (LLaMA, Mistral, MPT) from the model name.

What happens:

load_pretrained_model() is called with the model path and optional base model
The tokenizer is loaded with special image tokens registered
The language model is instantiated with the multimodal architecture
The CLIP vision tower is loaded and its image processor extracted
For LoRA models, adapter weights are loaded and optionally merged

Step 2: Preprocess Image

Load the input image (from a local file or URL) and process it through the CLIP image processor. The processor applies resizing, center cropping, normalization, and optional aspect-ratio-preserving padding to produce the tensor format expected by the vision encoder.

Key considerations:

Images can be loaded from local paths or HTTP/HTTPS URLs
The processor handles PIL Image conversion to RGB
Image tensor is cast to float16 and moved to the model's device
Multiple images can be processed for multi-image queries (comma-separated paths)

Step 3: Construct Conversation Prompt

Build the input prompt using the appropriate conversation template for the model variant. The template inserts the image token placeholder at the correct position and applies the model-specific formatting (system message, role prefixes, separators).

What happens:

Conversation mode is auto-detected from model name (llava_v1, llava_llama_2, mistral_instruct, mpt)
The <image> token is prepended to the user query
The conversation template formats the prompt with appropriate role markers
tokenizer_image_token() converts the prompt to token IDs with the IMAGE_TOKEN_INDEX placeholder

Step 4: Generate Response

Run the model's generate method with the tokenized prompt and image tensor. The generation uses the multimodal forward pass which replaces image token positions with CLIP visual embeddings projected through the MLP, then autoregressively generates text tokens.

What happens:

The model's prepare_inputs_labels_for_multimodal() fuses image features into the token sequence
Autoregressive generation produces output tokens using the configured sampling strategy
Temperature, top-p, beam search, and max token parameters control generation behavior
The output is decoded and stripped of special tokens to produce the final text response

Execution Diagram

GitHub URL

Workflow Repository