Implementation:Haotian liu LLaVA CLI Main
Appearance
Overview
Concrete tool for running interactive multi-turn visual chat from the command line. The main() function loads a LLaVA model, processes an image, and enters an interactive conversation loop.
Source
- File:
llava/serve/cli.py - Lines: L27-111
Signature
def main(args) -> None:
"""
Run interactive CLI chat with a LLaVA model.
Args (via argparse namespace):
args.model_path: str # HuggingFace ID or local path to model
args.model_base: str # Base model path (for LoRA adapters)
args.image_file: str # URL or local path to image
args.device: str = 'cuda' # Device to load model on
args.conv_mode: str # Conversation template (auto-detected from model name)
args.temperature: float = 0.2 # Sampling temperature
args.max_new_tokens: int = 512 # Maximum tokens to generate
args.load_8bit: bool # Enable 8-bit quantization
args.load_4bit: bool # Enable 4-bit quantization
"""
CLI Usage
python -m llava.serve.cli \
--model-path liuhaotian/llava-v1.5-13b \
--image-file image.jpg
With 4-bit quantization:
python -m llava.serve.cli \
--model-path liuhaotian/llava-v1.5-13b \
--image-file image.jpg \
--load-4bit
With LoRA adapter:
python -m llava.serve.cli \
--model-path /path/to/lora-adapter \
--model-base liuhaotian/llava-v1.5-13b \
--image-file image.jpg
Import
from llava.serve.cli import main
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
model_path |
str | Yes | HuggingFace model ID or local checkpoint path |
model_base |
str | For LoRA | Base model path when using LoRA adapters |
image_file |
str | Yes | Path to local image file or HTTP URL |
device |
str | No | Device for model loading (default: cuda)
|
conv_mode |
str | No | Conversation template (auto-detected from model name) |
temperature |
float | No | Sampling temperature (default: 0.2)
|
max_new_tokens |
int | No | Max tokens to generate (default: 512)
|
load_8bit |
bool | No | Enable 8-bit quantization |
load_4bit |
bool | No | Enable 4-bit NF4 quantization |
Outputs
Interactive streaming text responses printed to the terminal. The user interacts via stdin, and the model's responses are streamed token-by-token to stdout.
Description
The main() function executes the following sequence:
- Load model -- Calls
load_pretrained_model()to load the tokenizer, model, image processor, and determine context length. - Load and process image -- Loads the image from a file path or URL, converts to RGB, and preprocesses it using
process_images(). - Auto-detect conversation mode -- Determines the appropriate conversation template based on the model name (e.g.,
llava_v1,llava_llama_2,mistral_instruct). - Enter interactive loop:
- Read user input from stdin
- If first turn, prepend
<image>\nto the user message - Append user message to the
Conversationobject - Generate the full prompt via
conv.get_prompt() - Tokenize with
tokenizer_image_token() - Call
model.generate()withTextStreamerfor streaming output - Append assistant response to the conversation
- Repeat until the user exits
Metadata
| Field | Value |
|---|---|
| Knowledge Sources | Repo - LLaVA - https://github.com/haotian-liu/LLaVA |
| Domains | User_Interface, Interactive_Inference |
| Last Updated | 2026-02-13 14:00 GMT |
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment