Implementation:OpenGVLab InternVL LLaVA CLI
| Knowledge Sources | |
|---|---|
| Domains | Inference, CLI, LLaVA |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Interactive command-line chat interface for LLaVA multimodal models, enabling conversational image understanding from the terminal.
Description
This script provides a terminal-based interactive demo for LLaVA models. It loads a pretrained model via load_pretrained_model() with optional 8-bit or 4-bit quantization, then enters an interactive loop. The load_image() helper supports both local file paths and HTTP/HTTPS URLs. The conversation mode is auto-detected from the model name (supporting llava_llama_2, llava_v1, mpt, and llava_v0 templates), with an override via --conv-mode. On the first user message, the image is prepended to the prompt using DEFAULT_IMAGE_TOKEN (with optional start/end tokens based on model config). Subsequent messages are text-only within the same conversation context. Token generation uses torch.inference_mode(), sampling with configurable temperature and max_new_tokens, and HuggingFace's TextStreamer for real-time token-by-token output. The conversation history is maintained across turns, with a KeywordsStoppingCriteria to halt generation at the conversation separator.
Usage
Use this script for quick terminal-based testing of LLaVA models with a single image and multi-turn text conversation, without needing to set up a web server.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/serve/cli.py
- Lines: 1-125
Signature
def load_image(image_file): ...
def main(args): ...
Import
# Standalone script, run directly:
# python -m llava.serve.cli --model-path <path> --image-file <path>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --model-path | str | Yes | Path to the pretrained LLaVA model |
| --image-file | str | Yes | Path or URL to the input image |
| --model-base | str | No | Base model path for LoRA models |
| --device | str | No | Device to run on (default: "cuda") |
| --conv-mode | str | No | Conversation template override |
| --temperature | float | No | Sampling temperature (default: 0.2) |
| --max-new-tokens | int | No | Maximum generated tokens (default: 512) |
| --load-8bit | flag | No | Enable 8-bit quantization |
| --load-4bit | flag | No | Enable 4-bit quantization |
Outputs
| Name | Type | Description |
|---|---|---|
| stdout | text | Streamed model responses printed to terminal |
Usage Examples
Basic Usage
# From command line:
# python -m llava.serve.cli \
# --model-path liuhaotian/llava-v1.5-7b \
# --image-file https://example.com/image.jpg \
# --temperature 0.2
# Interactive session:
# user: What do you see in this image?
# assistant: I see a cat sitting on a windowsill...