Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Haotian liu LLaVA Single Image Inference

From Leeroopedia
Knowledge Sources
Domains LLMs, Inference, Multimodal, Vision_Language
Last Updated 2026-02-13 23:00 GMT

Overview

Run single-image visual question answering inference using a pre-trained LLaVA model from the command line.

Description

This workflow demonstrates the minimal inference pathway for querying a LLaVA model with an image and a text prompt. It covers loading a pre-trained or fine-tuned LLaVA checkpoint, preprocessing an image through the CLIP vision pipeline, constructing the appropriate conversation template, and generating a text response. The workflow handles multiple model backends (LLaMA, Vicuna, Mistral, MPT) with automatic conversation mode detection.

This is the simplest way to programmatically use LLaVA for inference and serves as the foundation for building custom inference applications.

Usage

Execute this workflow when you want to:

  • Run a quick inference test on a single image with a text query
  • Integrate LLaVA inference into a custom application or script
  • Verify that a trained model checkpoint produces correct visual understanding responses

Execution Steps

Step 1: Load Model and Processor

Initialize the LLaVA model using the unified model builder, which handles the complete loading pipeline: tokenizer initialization, language model instantiation (with optional LoRA or quantization), vision tower loading, and image processor extraction. The builder automatically detects the model variant (LLaMA, Mistral, MPT) from the model name.

What happens:

  • load_pretrained_model() is called with the model path and optional base model
  • The tokenizer is loaded with special image tokens registered
  • The language model is instantiated with the multimodal architecture
  • The CLIP vision tower is loaded and its image processor extracted
  • For LoRA models, adapter weights are loaded and optionally merged

Step 2: Preprocess Image

Load the input image (from a local file or URL) and process it through the CLIP image processor. The processor applies resizing, center cropping, normalization, and optional aspect-ratio-preserving padding to produce the tensor format expected by the vision encoder.

Key considerations:

  • Images can be loaded from local paths or HTTP/HTTPS URLs
  • The processor handles PIL Image conversion to RGB
  • Image tensor is cast to float16 and moved to the model's device
  • Multiple images can be processed for multi-image queries (comma-separated paths)

Step 3: Construct Conversation Prompt

Build the input prompt using the appropriate conversation template for the model variant. The template inserts the image token placeholder at the correct position and applies the model-specific formatting (system message, role prefixes, separators).

What happens:

  • Conversation mode is auto-detected from model name (llava_v1, llava_llama_2, mistral_instruct, mpt)
  • The <image> token is prepended to the user query
  • The conversation template formats the prompt with appropriate role markers
  • tokenizer_image_token() converts the prompt to token IDs with the IMAGE_TOKEN_INDEX placeholder

Step 4: Generate Response

Run the model's generate method with the tokenized prompt and image tensor. The generation uses the multimodal forward pass which replaces image token positions with CLIP visual embeddings projected through the MLP, then autoregressively generates text tokens.

What happens:

  • The model's prepare_inputs_labels_for_multimodal() fuses image features into the token sequence
  • Autoregressive generation produces output tokens using the configured sampling strategy
  • Temperature, top-p, beam search, and max token parameters control generation behavior
  • The output is decoded and stripped of special tokens to produce the final text response

Execution Diagram

GitHub URL

Workflow Repository