Implementation:Mit han lab Llm awq VILA10 Demo
| Knowledge Sources | |
|---|---|
| Domains | Demo, Multimodal |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Interactive multimodal chat demo for VILA 1.0 models with W16A16 and W4A16 precision support, device warmup, image processing via LLaVA utilities, and streaming text generation.
Description
This script provides an interactive command-line chat interface for VILA 1.0 (LLaVA-based) vision-language models, supporting both full-precision and AWQ-quantized inference modes.
image_parser splits the --image-file argument string on the separator (--im-sep, default comma) to produce a list of image file paths, enabling multiple images to be specified in a single argument.
main handles the complete demo lifecycle:
Model Loading: The LlavaLlamaForCausalLM model is loaded from AutoConfig with accelerated initialization (disabled parameter init functions). The tokenizer is loaded separately to convert the LLAVA_DEFAULT_IMAGE_PATCH_TOKEN into its token ID. The vision tower is loaded if not already initialized, and the image_processor is extracted from it. The model and vision tower are cast to half precision.
Precision Modes: Two modes are supported. W16A16 loads the full-precision checkpoint using load_checkpoint_and_dispatch from the Accelerate library with module-level sharding (supporting OPT, LLaMA, Bloom, MPT, and CLIP decoder layers). W4A16 loads AWQ-quantized weights via load_awq_model (4-bit, group size 128) and applies fused kernel replacements: make_quant_attn and make_quant_norm. Other precision values raise NotImplementedError.
Image Preparation: Images are loaded via load_images, optionally visualized in the terminal with vis_images, and preprocessed using process_images (which applies square padding and the model's image processor). The resulting tensor is moved to the target device in float16 format. device_warmup and tune_llava_patch_embedding optimize CUDA execution for the vision tower.
Chat Loop: An interactive loop reads user prompts, manages image token insertion (detecting <image> placeholders or prepending image tokens), and generates streaming responses via LlavaStreamGenerator. Multi-turn conversation is supported via model_prompter.update_template when --single_round is not set and the KV cache exceeds 512 tokens. TimeStats collects and displays timing information. Empty input triggers exit with statistics display.
Usage
Run from the command line to start an interactive VILA 1.0 chat session:
# W4A16 quantized inference with image
python tinychat/vila10_demo.py \
--model-path /path/to/llava-v1.5-7b \
--quant-path /path/to/llava-v1.5-7b-w4-g128-awq.pt \
--precision W4A16 \
--image-file image.jpg
# Full precision inference
python tinychat/vila10_demo.py \
--model-path /path/to/llava-v1.5-7b \
--precision W16A16 \
--image-file https://example.com/image.jpg
Code Reference
Source Location
- Repository: Mit_han_lab_Llm_awq
- File: tinychat/vila10_demo.py
- Lines: 1-233
Signature
def image_parser(args):
def main(args):
Import
# CLI script, run directly:
python tinychat/vila10_demo.py [OPTIONS]
I/O Contract
image_parser
| Parameter | Type | Description |
|---|---|---|
| args | Namespace | Parsed CLI args with image_file (str) and im_sep (str) attributes |
| Returns | Type | Description |
|---|---|---|
| image_paths | list[str] | List of image file paths split on separator |
CLI Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
| --model_type | str | LLaMa | Model type for prompt template selection |
| --model-path | str | (required) | Path to LLaVA/VILA 1.0 model checkpoint |
| --quant-path | str | (path) | Path to AWQ quantized weight file |
| --precision | str | W4A16 | Compute precision: "W16A16" or "W4A16" |
| --image-file | str | (URL) | Comma-separated image paths or URLs |
| --im-sep | str | , | Separator for multiple image paths |
| --device | str | cuda | CUDA device |
| --max_seq_len | int | 2048 | Maximum sequence length / KV cache size |
| --single_round | flag | False | Disable multi-turn conversation memory |
| --vis-image | flag | False | Visualize input images in terminal |
| --empty-prompt | flag | False | Use empty prompt template |
Interactive I/O
| Direction | Description |
|---|---|
| Input | User text prompts via stdin; <image> placeholders supported; empty input exits
|
| Output | Streamed assistant responses to stdout with timing statistics on exit |
Usage Examples
# Quantized VILA 1.0 with multiple images
python tinychat/vila10_demo.py \
--model_type LLaMa \
--model-path /models/llava-v1.5-7b \
--quant-path /models/llava-v1.5-7b-w4-g128-awq.pt \
--precision W4A16 \
--image-file "image1.jpg,image2.jpg" \
--vis-image \
--max_seq_len 2048
# Interactive session:
# USER: Describe both images
# ASSISTANT: The first image shows...
# USER: What is different between them?
# ASSISTANT: The key differences are...
# USER: (empty to exit)