Implementation:Mit han lab Llm awq NVILA Demo

Knowledge Sources	Mit_han_lab_Llm_awq
Domains	Demo, Multimodal
Last Updated	2026-02-15 00:00 GMT

Overview

Interactive multimodal chat demo for NVILA models with HuggingFace model/weight downloading, AWQ quantization, smooth quantization of the vision tower, streaming generation, and multi-turn conversation support.

Description

This script provides a complete interactive command-line chat interface for NVILA (NVILAQwen2) vision-language models, with built-in support for downloading quantized weights from HuggingFace Hub.

download_model_file is a utility function that downloads model files from HuggingFace Hub using hf_hub_download. It takes a repo_id (default: "Efficient-Large-Model/NVILA-AWQ"), filename, and local_dir, creating the local directory if needed. It supports force_download and checks for existing files to avoid redundant downloads. The function uses resume_download=True for robust handling of interrupted downloads.

main handles the complete demo lifecycle:

Model Loading: The NVILAQwen2 model is instantiated from AutoConfig. When --quant_llm or --all is specified, it is created without pretrained LLM weights (NVILAQwen2(config, False)); otherwise, pretrained weights are loaded (NVILAQwen2(config, True)). The model is cast to half precision.

Quantization Pipeline: Three quantization stages are supported. --smooth_VT downloads activation scales via download_model_file and applies smooth quantization to the vision tower using smooth_lm with alpha 0.3. --quant_llm downloads the AWQ checkpoint, creates a fresh Qwen2ForCausalLM, loads quantized weights via load_awq_model (4-bit, group size 128), applies fused kernels (make_quant_attn, make_quant_norm), and resizes token embeddings. --quant_VT wraps the SigLIP encoder with QuantSiglipEncoder or applies fake_quant when --fakequant_VT is set. After quantization, device_warmup and tune_llava_patch_embedding optimize CUDA execution.

Media Preparation: The --media argument accepts multiple image (.jpg/.jpeg/.png) or video (.mp4/.mkv/.webm) files. Each is wrapped in Image or Video from llava.media, and model.prepare_media produces tensors and configuration. Optional terminal visualization is available via vis_images.

Chat Loop: An interactive loop reads user prompts, builds formatted input via get_prompter and model_prompter, and generates streaming responses via NVILAStreamGenerator. On the first turn, media placeholders are prepended. Multi-turn conversation is supported through model_prompter.update_template, with chunk_prefilling avoiding history recomputation. TimeStats tracks and displays performance metrics. Empty input triggers exit with statistics.

Usage

Run from the command line to start an interactive NVILA chat session:

# Image chat with full quantization (auto-downloads weights)
python tinychat/nvila_demo.py \
    --model-path /path/to/nvila \
    --quant_path nvila-8b-w4-g128-awq.pt \
    --act_scale_path nvila-8b-act-scales.pt \
    --media image.jpg \
    --all \
    --chunk_prefilling

# Video chat without quantization
python tinychat/nvila_demo.py \
    --model-path /path/to/nvila \
    --media video.mp4

Code Reference

Source Location

Repository: Mit_han_lab_Llm_awq
File: tinychat/nvila_demo.py
Lines: 1-272

Signature

def download_model_file(
    repo_id: str = "Efficient-Large-Model/NVILA-AWQ",
    filename: str = None,
    local_dir: str = "./hf_cache",
    force_download: bool = False,
) -> str:

def main(args):

Import

# CLI script, run directly:
python tinychat/nvila_demo.py [OPTIONS]

I/O Contract

download_model_file

Parameter	Type	Default	Description
repo_id	str	"Efficient-Large-Model/NVILA-AWQ"	HuggingFace Hub repository ID
filename	str	None	Filename to download from the repo
local_dir	str	"./hf_cache"	Local directory for cached downloads
force_download	bool	False	Force re-download even if file exists

Returns	Type	Description
local_path	str	Absolute path to the downloaded file

CLI Arguments

Argument	Type	Default	Description
--model_type	str	LLaMa	Model type for prompt template selection
--model-path	str	(required)	Path to NVILA model checkpoint
--quant_path	str	(path)	AWQ weight file path or HF Hub filename
--act_scale_path	str	/PATH/TO/SCALE	Activation scale file path or HF Hub filename
--media	str (nargs=+)	None	Image or video file paths
--device	str	cuda:0	CUDA device
--max_seq_len	int	2048	Maximum sequence length / KV cache size
--single_round	flag	False	Disable multi-turn conversation memory
--vis-image	flag	False	Visualize input images in terminal
--empty-prompt	flag	False	Use empty prompt template
--flash_attn	flag	False	Enable flash attention
--chunk_prefilling	flag	False	Enable chunk prefilling for multi-turn speedup
--quant_llm	flag	False	Load AWQ-quantized LLM weights
--quant_VT	flag	False	Quantize vision tower (SigLIP encoder)
--smooth_VT	flag	False	Apply smooth quantization to vision tower
--all	flag	False	Enable all quantization options
--fakequant_VT	flag	False	Use fake quantization for vision tower

Interactive I/O

Direction	Description
Input	User text prompts via stdin; empty input exits
Output	Streamed assistant responses to stdout with timing statistics on exit

Usage Examples

# Full quantization pipeline with HuggingFace weight download
python tinychat/nvila_demo.py \
    --model_type LLaMa \
    --model-path Efficient-Large-Model/nvila-8b \
    --quant_path nvila-8b-w4-g128-awq.pt \
    --act_scale_path nvila-8b-act-scales.pt \
    --media /data/photo.jpg \
    --all \
    --chunk_prefilling \
    --max_seq_len 2048

# Multi-media session
python tinychat/nvila_demo.py \
    --model-path /models/nvila-8b \
    --media image1.jpg image2.jpg \
    --quant_llm \
    --vis-image

Related Pages

Principle:Mit_han_lab_Llm_awq_Interactive_Multimodal_Demo

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment