Implementation:Haotian liu LLaVA CLI Main

Overview

Concrete tool for running interactive multi-turn visual chat from the command line. The main() function loads a LLaVA model, processes an image, and enters an interactive conversation loop.

Source

File: llava/serve/cli.py
Lines: L27-111

Signature

def main(args) -> None:
    """
    Run interactive CLI chat with a LLaVA model.

    Args (via argparse namespace):
        args.model_path: str          # HuggingFace ID or local path to model
        args.model_base: str          # Base model path (for LoRA adapters)
        args.image_file: str          # URL or local path to image
        args.device: str = 'cuda'     # Device to load model on
        args.conv_mode: str           # Conversation template (auto-detected from model name)
        args.temperature: float = 0.2 # Sampling temperature
        args.max_new_tokens: int = 512 # Maximum tokens to generate
        args.load_8bit: bool          # Enable 8-bit quantization
        args.load_4bit: bool          # Enable 4-bit quantization
    """

CLI Usage

python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-13b \
    --image-file image.jpg

With 4-bit quantization:

python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-13b \
    --image-file image.jpg \
    --load-4bit

With LoRA adapter:

python -m llava.serve.cli \
    --model-path /path/to/lora-adapter \
    --model-base liuhaotian/llava-v1.5-13b \
    --image-file image.jpg

Import

from llava.serve.cli import main

Inputs

Parameter	Type	Required	Description
`model_path`	str	Yes	HuggingFace model ID or local checkpoint path
`model_base`	str	For LoRA	Base model path when using LoRA adapters
`image_file`	str	Yes	Path to local image file or HTTP URL
`device`	str	No	Device for model loading (default: `cuda`)
`conv_mode`	str	No	Conversation template (auto-detected from model name)
`temperature`	float	No	Sampling temperature (default: `0.2`)
`max_new_tokens`	int	No	Max tokens to generate (default: `512`)
`load_8bit`	bool	No	Enable 8-bit quantization
`load_4bit`	bool	No	Enable 4-bit NF4 quantization

Outputs

Interactive streaming text responses printed to the terminal. The user interacts via stdin, and the model's responses are streamed token-by-token to stdout.

Description

The main() function executes the following sequence:

Load model -- Calls load_pretrained_model() to load the tokenizer, model, image processor, and determine context length.
Load and process image -- Loads the image from a file path or URL, converts to RGB, and preprocesses it using process_images().
Auto-detect conversation mode -- Determines the appropriate conversation template based on the model name (e.g., llava_v1, llava_llama_2, mistral_instruct).
Enter interactive loop:
- Read user input from stdin
- If first turn, prepend <image>\n to the user message
- Append user message to the Conversation object
- Generate the full prompt via conv.get_prompt()
- Tokenize with tokenizer_image_token()
- Call model.generate() with TextStreamer for streaming output
- Append assistant response to the conversation
- Repeat until the user exits

Metadata

Field	Value
Knowledge Sources	Repo - LLaVA - https://github.com/haotian-liu/LLaVA
Domains	User_Interface, Interactive_Inference
Last Updated	2026-02-13 14:00 GMT

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment