Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mit han lab Llm awq VILA10 Demo

From Leeroopedia
Knowledge Sources
Domains Demo, Multimodal
Last Updated 2026-02-15 00:00 GMT

Overview

Interactive multimodal chat demo for VILA 1.0 models with W16A16 and W4A16 precision support, device warmup, image processing via LLaVA utilities, and streaming text generation.

Description

This script provides an interactive command-line chat interface for VILA 1.0 (LLaVA-based) vision-language models, supporting both full-precision and AWQ-quantized inference modes.

image_parser splits the --image-file argument string on the separator (--im-sep, default comma) to produce a list of image file paths, enabling multiple images to be specified in a single argument.

main handles the complete demo lifecycle:

Model Loading: The LlavaLlamaForCausalLM model is loaded from AutoConfig with accelerated initialization (disabled parameter init functions). The tokenizer is loaded separately to convert the LLAVA_DEFAULT_IMAGE_PATCH_TOKEN into its token ID. The vision tower is loaded if not already initialized, and the image_processor is extracted from it. The model and vision tower are cast to half precision.

Precision Modes: Two modes are supported. W16A16 loads the full-precision checkpoint using load_checkpoint_and_dispatch from the Accelerate library with module-level sharding (supporting OPT, LLaMA, Bloom, MPT, and CLIP decoder layers). W4A16 loads AWQ-quantized weights via load_awq_model (4-bit, group size 128) and applies fused kernel replacements: make_quant_attn and make_quant_norm. Other precision values raise NotImplementedError.

Image Preparation: Images are loaded via load_images, optionally visualized in the terminal with vis_images, and preprocessed using process_images (which applies square padding and the model's image processor). The resulting tensor is moved to the target device in float16 format. device_warmup and tune_llava_patch_embedding optimize CUDA execution for the vision tower.

Chat Loop: An interactive loop reads user prompts, manages image token insertion (detecting <image> placeholders or prepending image tokens), and generates streaming responses via LlavaStreamGenerator. Multi-turn conversation is supported via model_prompter.update_template when --single_round is not set and the KV cache exceeds 512 tokens. TimeStats collects and displays timing information. Empty input triggers exit with statistics display.

Usage

Run from the command line to start an interactive VILA 1.0 chat session:

# W4A16 quantized inference with image
python tinychat/vila10_demo.py \
    --model-path /path/to/llava-v1.5-7b \
    --quant-path /path/to/llava-v1.5-7b-w4-g128-awq.pt \
    --precision W4A16 \
    --image-file image.jpg

# Full precision inference
python tinychat/vila10_demo.py \
    --model-path /path/to/llava-v1.5-7b \
    --precision W16A16 \
    --image-file https://example.com/image.jpg

Code Reference

Source Location

Signature

def image_parser(args):

def main(args):

Import

# CLI script, run directly:
python tinychat/vila10_demo.py [OPTIONS]

I/O Contract

image_parser

Parameter Type Description
args Namespace Parsed CLI args with image_file (str) and im_sep (str) attributes
Returns Type Description
image_paths list[str] List of image file paths split on separator

CLI Arguments

Argument Type Default Description
--model_type str LLaMa Model type for prompt template selection
--model-path str (required) Path to LLaVA/VILA 1.0 model checkpoint
--quant-path str (path) Path to AWQ quantized weight file
--precision str W4A16 Compute precision: "W16A16" or "W4A16"
--image-file str (URL) Comma-separated image paths or URLs
--im-sep str , Separator for multiple image paths
--device str cuda CUDA device
--max_seq_len int 2048 Maximum sequence length / KV cache size
--single_round flag False Disable multi-turn conversation memory
--vis-image flag False Visualize input images in terminal
--empty-prompt flag False Use empty prompt template

Interactive I/O

Direction Description
Input User text prompts via stdin; <image> placeholders supported; empty input exits
Output Streamed assistant responses to stdout with timing statistics on exit

Usage Examples

# Quantized VILA 1.0 with multiple images
python tinychat/vila10_demo.py \
    --model_type LLaMa \
    --model-path /models/llava-v1.5-7b \
    --quant-path /models/llava-v1.5-7b-w4-g128-awq.pt \
    --precision W4A16 \
    --image-file "image1.jpg,image2.jpg" \
    --vis-image \
    --max_seq_len 2048

# Interactive session:
# USER: Describe both images
# ASSISTANT: The first image shows...
# USER: What is different between them?
# ASSISTANT: The key differences are...
# USER: (empty to exit)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment