Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mit han lab Llm awq NVILA Demo

From Leeroopedia
Knowledge Sources
Domains Demo, Multimodal
Last Updated 2026-02-15 00:00 GMT

Overview

Interactive multimodal chat demo for NVILA models with HuggingFace model/weight downloading, AWQ quantization, smooth quantization of the vision tower, streaming generation, and multi-turn conversation support.

Description

This script provides a complete interactive command-line chat interface for NVILA (NVILAQwen2) vision-language models, with built-in support for downloading quantized weights from HuggingFace Hub.

download_model_file is a utility function that downloads model files from HuggingFace Hub using hf_hub_download. It takes a repo_id (default: "Efficient-Large-Model/NVILA-AWQ"), filename, and local_dir, creating the local directory if needed. It supports force_download and checks for existing files to avoid redundant downloads. The function uses resume_download=True for robust handling of interrupted downloads.

main handles the complete demo lifecycle:

Model Loading: The NVILAQwen2 model is instantiated from AutoConfig. When --quant_llm or --all is specified, it is created without pretrained LLM weights (NVILAQwen2(config, False)); otherwise, pretrained weights are loaded (NVILAQwen2(config, True)). The model is cast to half precision.

Quantization Pipeline: Three quantization stages are supported. --smooth_VT downloads activation scales via download_model_file and applies smooth quantization to the vision tower using smooth_lm with alpha 0.3. --quant_llm downloads the AWQ checkpoint, creates a fresh Qwen2ForCausalLM, loads quantized weights via load_awq_model (4-bit, group size 128), applies fused kernels (make_quant_attn, make_quant_norm), and resizes token embeddings. --quant_VT wraps the SigLIP encoder with QuantSiglipEncoder or applies fake_quant when --fakequant_VT is set. After quantization, device_warmup and tune_llava_patch_embedding optimize CUDA execution.

Media Preparation: The --media argument accepts multiple image (.jpg/.jpeg/.png) or video (.mp4/.mkv/.webm) files. Each is wrapped in Image or Video from llava.media, and model.prepare_media produces tensors and configuration. Optional terminal visualization is available via vis_images.

Chat Loop: An interactive loop reads user prompts, builds formatted input via get_prompter and model_prompter, and generates streaming responses via NVILAStreamGenerator. On the first turn, media placeholders are prepended. Multi-turn conversation is supported through model_prompter.update_template, with chunk_prefilling avoiding history recomputation. TimeStats tracks and displays performance metrics. Empty input triggers exit with statistics.

Usage

Run from the command line to start an interactive NVILA chat session:

# Image chat with full quantization (auto-downloads weights)
python tinychat/nvila_demo.py \
    --model-path /path/to/nvila \
    --quant_path nvila-8b-w4-g128-awq.pt \
    --act_scale_path nvila-8b-act-scales.pt \
    --media image.jpg \
    --all \
    --chunk_prefilling

# Video chat without quantization
python tinychat/nvila_demo.py \
    --model-path /path/to/nvila \
    --media video.mp4

Code Reference

Source Location

Signature

def download_model_file(
    repo_id: str = "Efficient-Large-Model/NVILA-AWQ",
    filename: str = None,
    local_dir: str = "./hf_cache",
    force_download: bool = False,
) -> str:

def main(args):

Import

# CLI script, run directly:
python tinychat/nvila_demo.py [OPTIONS]

I/O Contract

download_model_file

Parameter Type Default Description
repo_id str "Efficient-Large-Model/NVILA-AWQ" HuggingFace Hub repository ID
filename str None Filename to download from the repo
local_dir str "./hf_cache" Local directory for cached downloads
force_download bool False Force re-download even if file exists
Returns Type Description
local_path str Absolute path to the downloaded file

CLI Arguments

Argument Type Default Description
--model_type str LLaMa Model type for prompt template selection
--model-path str (required) Path to NVILA model checkpoint
--quant_path str (path) AWQ weight file path or HF Hub filename
--act_scale_path str /PATH/TO/SCALE Activation scale file path or HF Hub filename
--media str (nargs=+) None Image or video file paths
--device str cuda:0 CUDA device
--max_seq_len int 2048 Maximum sequence length / KV cache size
--single_round flag False Disable multi-turn conversation memory
--vis-image flag False Visualize input images in terminal
--empty-prompt flag False Use empty prompt template
--flash_attn flag False Enable flash attention
--chunk_prefilling flag False Enable chunk prefilling for multi-turn speedup
--quant_llm flag False Load AWQ-quantized LLM weights
--quant_VT flag False Quantize vision tower (SigLIP encoder)
--smooth_VT flag False Apply smooth quantization to vision tower
--all flag False Enable all quantization options
--fakequant_VT flag False Use fake quantization for vision tower

Interactive I/O

Direction Description
Input User text prompts via stdin; empty input exits
Output Streamed assistant responses to stdout with timing statistics on exit

Usage Examples

# Full quantization pipeline with HuggingFace weight download
python tinychat/nvila_demo.py \
    --model_type LLaMa \
    --model-path Efficient-Large-Model/nvila-8b \
    --quant_path nvila-8b-w4-g128-awq.pt \
    --act_scale_path nvila-8b-act-scales.pt \
    --media /data/photo.jpg \
    --all \
    --chunk_prefilling \
    --max_seq_len 2048

# Multi-media session
python tinychat/nvila_demo.py \
    --model-path /models/nvila-8b \
    --media image1.jpg image2.jpg \
    --quant_llm \
    --vis-image

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment