Principle:Haotian liu LLaVA CLI Interactive Chat
Overview
Interactive command-line interface pattern for multi-turn visual conversation with streaming text output.
Description
The CLI chat provides a terminal-based REPL (Read-Eval-Print Loop) for multi-turn conversation with a LLaVA model about a single image. Unlike the distributed controller-worker-Gradio stack, the CLI loads the model directly into the process.
Key characteristics:
- Direct model loading -- No controller or worker architecture; the model is loaded directly into the CLI process.
- Single-image context -- The image is processed once at startup and reused across all conversation turns.
- Multi-turn conversation -- The user can ask multiple sequential questions, with full conversation history maintained across turns.
- Streaming output -- Responses stream token-by-token using
TextStreamerfor real-time terminal output. - Conversation history -- Messages are appended to a
Conversationobject, and the full prompt is regenerated each turn.
Usage
Use for quick testing and debugging of LLaVA models without the overhead of deploying the controller-worker-Gradio stack.
Supported configurations:
- LoRA models -- Use
--model-baseto specify the base model when loading a LoRA adapter. - Quantized inference -- Use
--load-4bitor--load-8bitfor inference on smaller GPUs. - Custom conversation modes -- Use
--conv-modeto override the auto-detected conversation template.
Theoretical Basis
Multi-turn conversation is implemented by appending messages to a Conversation object and regenerating the full prompt each turn. This means the full conversation history is re-tokenized on every turn, which is acceptable for interactive use but not optimal for high-throughput serving.
Token streaming uses TextStreamer, which hooks into model.generate() to print tokens to stdout as they are produced. This provides immediate visual feedback in the terminal.
Image token placement: The <image> token is prepended to the first user message only. On subsequent turns, the image context is carried implicitly through the conversation history and the cached image tensor.
Metadata
| Field | Value |
|---|---|
| Knowledge Sources | Repo - LLaVA - https://github.com/haotian-liu/LLaVA |
| Domains | User_Interface, Interactive_Inference |
| Last Updated | 2026-02-13 14:00 GMT |