Implementation:Mit han lab Llm awq VILA10 Demo

Knowledge Sources	Mit_han_lab_Llm_awq
Domains	Demo, Multimodal
Last Updated	2026-02-15 00:00 GMT

Overview

Interactive multimodal chat demo for VILA 1.0 models with W16A16 and W4A16 precision support, device warmup, image processing via LLaVA utilities, and streaming text generation.

Description

This script provides an interactive command-line chat interface for VILA 1.0 (LLaVA-based) vision-language models, supporting both full-precision and AWQ-quantized inference modes.

image_parser splits the --image-file argument string on the separator (--im-sep, default comma) to produce a list of image file paths, enabling multiple images to be specified in a single argument.

main handles the complete demo lifecycle:

Model Loading: The LlavaLlamaForCausalLM model is loaded from AutoConfig with accelerated initialization (disabled parameter init functions). The tokenizer is loaded separately to convert the LLAVA_DEFAULT_IMAGE_PATCH_TOKEN into its token ID. The vision tower is loaded if not already initialized, and the image_processor is extracted from it. The model and vision tower are cast to half precision.

Precision Modes: Two modes are supported. W16A16 loads the full-precision checkpoint using load_checkpoint_and_dispatch from the Accelerate library with module-level sharding (supporting OPT, LLaMA, Bloom, MPT, and CLIP decoder layers). W4A16 loads AWQ-quantized weights via load_awq_model (4-bit, group size 128) and applies fused kernel replacements: make_quant_attn and make_quant_norm. Other precision values raise NotImplementedError.

Image Preparation: Images are loaded via load_images, optionally visualized in the terminal with vis_images, and preprocessed using process_images (which applies square padding and the model's image processor). The resulting tensor is moved to the target device in float16 format. device_warmup and tune_llava_patch_embedding optimize CUDA execution for the vision tower.

Chat Loop: An interactive loop reads user prompts, manages image token insertion (detecting <image> placeholders or prepending image tokens), and generates streaming responses via LlavaStreamGenerator. Multi-turn conversation is supported via model_prompter.update_template when --single_round is not set and the KV cache exceeds 512 tokens. TimeStats collects and displays timing information. Empty input triggers exit with statistics display.

Usage

Run from the command line to start an interactive VILA 1.0 chat session:

# W4A16 quantized inference with image
python tinychat/vila10_demo.py \
    --model-path /path/to/llava-v1.5-7b \
    --quant-path /path/to/llava-v1.5-7b-w4-g128-awq.pt \
    --precision W4A16 \
    --image-file image.jpg

# Full precision inference
python tinychat/vila10_demo.py \
    --model-path /path/to/llava-v1.5-7b \
    --precision W16A16 \
    --image-file https://example.com/image.jpg

Code Reference

Source Location

Repository: Mit_han_lab_Llm_awq
File: tinychat/vila10_demo.py
Lines: 1-233

Signature

def image_parser(args):

def main(args):

Import

# CLI script, run directly:
python tinychat/vila10_demo.py [OPTIONS]

I/O Contract

image_parser

Parameter	Type	Description
args	Namespace	Parsed CLI args with image_file (str) and im_sep (str) attributes

Returns	Type	Description
image_paths	list[str]	List of image file paths split on separator

CLI Arguments

Argument	Type	Default	Description
--model_type	str	LLaMa	Model type for prompt template selection
--model-path	str	(required)	Path to LLaVA/VILA 1.0 model checkpoint
--quant-path	str	(path)	Path to AWQ quantized weight file
--precision	str	W4A16	Compute precision: "W16A16" or "W4A16"
--image-file	str	(URL)	Comma-separated image paths or URLs
--im-sep	str	,	Separator for multiple image paths
--device	str	cuda	CUDA device
--max_seq_len	int	2048	Maximum sequence length / KV cache size
--single_round	flag	False	Disable multi-turn conversation memory
--vis-image	flag	False	Visualize input images in terminal
--empty-prompt	flag	False	Use empty prompt template

Interactive I/O

Direction	Description
Input	User text prompts via stdin; `<image>` placeholders supported; empty input exits
Output	Streamed assistant responses to stdout with timing statistics on exit

Usage Examples

# Quantized VILA 1.0 with multiple images
python tinychat/vila10_demo.py \
    --model_type LLaMa \
    --model-path /models/llava-v1.5-7b \
    --quant-path /models/llava-v1.5-7b-w4-g128-awq.pt \
    --precision W4A16 \
    --image-file "image1.jpg,image2.jpg" \
    --vis-image \
    --max_seq_len 2048

# Interactive session:
# USER: Describe both images
# ASSISTANT: The first image shows...
# USER: What is different between them?
# ASSISTANT: The key differences are...
# USER: (empty to exit)

Related Pages

Principle:Mit_han_lab_Llm_awq_Interactive_Multimodal_Demo

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment