Implementation:Mit han lab Llm awq NVILA Demo
| Knowledge Sources | |
|---|---|
| Domains | Demo, Multimodal |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Interactive multimodal chat demo for NVILA models with HuggingFace model/weight downloading, AWQ quantization, smooth quantization of the vision tower, streaming generation, and multi-turn conversation support.
Description
This script provides a complete interactive command-line chat interface for NVILA (NVILAQwen2) vision-language models, with built-in support for downloading quantized weights from HuggingFace Hub.
download_model_file is a utility function that downloads model files from HuggingFace Hub using hf_hub_download. It takes a repo_id (default: "Efficient-Large-Model/NVILA-AWQ"), filename, and local_dir, creating the local directory if needed. It supports force_download and checks for existing files to avoid redundant downloads. The function uses resume_download=True for robust handling of interrupted downloads.
main handles the complete demo lifecycle:
Model Loading: The NVILAQwen2 model is instantiated from AutoConfig. When --quant_llm or --all is specified, it is created without pretrained LLM weights (NVILAQwen2(config, False)); otherwise, pretrained weights are loaded (NVILAQwen2(config, True)). The model is cast to half precision.
Quantization Pipeline: Three quantization stages are supported. --smooth_VT downloads activation scales via download_model_file and applies smooth quantization to the vision tower using smooth_lm with alpha 0.3. --quant_llm downloads the AWQ checkpoint, creates a fresh Qwen2ForCausalLM, loads quantized weights via load_awq_model (4-bit, group size 128), applies fused kernels (make_quant_attn, make_quant_norm), and resizes token embeddings. --quant_VT wraps the SigLIP encoder with QuantSiglipEncoder or applies fake_quant when --fakequant_VT is set. After quantization, device_warmup and tune_llava_patch_embedding optimize CUDA execution.
Media Preparation: The --media argument accepts multiple image (.jpg/.jpeg/.png) or video (.mp4/.mkv/.webm) files. Each is wrapped in Image or Video from llava.media, and model.prepare_media produces tensors and configuration. Optional terminal visualization is available via vis_images.
Chat Loop: An interactive loop reads user prompts, builds formatted input via get_prompter and model_prompter, and generates streaming responses via NVILAStreamGenerator. On the first turn, media placeholders are prepended. Multi-turn conversation is supported through model_prompter.update_template, with chunk_prefilling avoiding history recomputation. TimeStats tracks and displays performance metrics. Empty input triggers exit with statistics.
Usage
Run from the command line to start an interactive NVILA chat session:
# Image chat with full quantization (auto-downloads weights)
python tinychat/nvila_demo.py \
--model-path /path/to/nvila \
--quant_path nvila-8b-w4-g128-awq.pt \
--act_scale_path nvila-8b-act-scales.pt \
--media image.jpg \
--all \
--chunk_prefilling
# Video chat without quantization
python tinychat/nvila_demo.py \
--model-path /path/to/nvila \
--media video.mp4
Code Reference
Source Location
- Repository: Mit_han_lab_Llm_awq
- File: tinychat/nvila_demo.py
- Lines: 1-272
Signature
def download_model_file(
repo_id: str = "Efficient-Large-Model/NVILA-AWQ",
filename: str = None,
local_dir: str = "./hf_cache",
force_download: bool = False,
) -> str:
def main(args):
Import
# CLI script, run directly:
python tinychat/nvila_demo.py [OPTIONS]
I/O Contract
download_model_file
| Parameter | Type | Default | Description |
|---|---|---|---|
| repo_id | str | "Efficient-Large-Model/NVILA-AWQ" | HuggingFace Hub repository ID |
| filename | str | None | Filename to download from the repo |
| local_dir | str | "./hf_cache" | Local directory for cached downloads |
| force_download | bool | False | Force re-download even if file exists |
| Returns | Type | Description |
|---|---|---|
| local_path | str | Absolute path to the downloaded file |
CLI Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
| --model_type | str | LLaMa | Model type for prompt template selection |
| --model-path | str | (required) | Path to NVILA model checkpoint |
| --quant_path | str | (path) | AWQ weight file path or HF Hub filename |
| --act_scale_path | str | /PATH/TO/SCALE | Activation scale file path or HF Hub filename |
| --media | str (nargs=+) | None | Image or video file paths |
| --device | str | cuda:0 | CUDA device |
| --max_seq_len | int | 2048 | Maximum sequence length / KV cache size |
| --single_round | flag | False | Disable multi-turn conversation memory |
| --vis-image | flag | False | Visualize input images in terminal |
| --empty-prompt | flag | False | Use empty prompt template |
| --flash_attn | flag | False | Enable flash attention |
| --chunk_prefilling | flag | False | Enable chunk prefilling for multi-turn speedup |
| --quant_llm | flag | False | Load AWQ-quantized LLM weights |
| --quant_VT | flag | False | Quantize vision tower (SigLIP encoder) |
| --smooth_VT | flag | False | Apply smooth quantization to vision tower |
| --all | flag | False | Enable all quantization options |
| --fakequant_VT | flag | False | Use fake quantization for vision tower |
Interactive I/O
| Direction | Description |
|---|---|
| Input | User text prompts via stdin; empty input exits |
| Output | Streamed assistant responses to stdout with timing statistics on exit |
Usage Examples
# Full quantization pipeline with HuggingFace weight download
python tinychat/nvila_demo.py \
--model_type LLaMa \
--model-path Efficient-Large-Model/nvila-8b \
--quant_path nvila-8b-w4-g128-awq.pt \
--act_scale_path nvila-8b-act-scales.pt \
--media /data/photo.jpg \
--all \
--chunk_prefilling \
--max_seq_len 2048
# Multi-media session
python tinychat/nvila_demo.py \
--model-path /models/nvila-8b \
--media image1.jpg image2.jpg \
--quant_llm \
--vis-image