Implementation:Mit han lab Llm awq NVILA Benchmark

Knowledge Sources	Mit_han_lab_Llm_awq
Domains	Benchmarking, Multimodal
Last Updated	2026-02-15 00:00 GMT

Overview

CLI benchmarking script for NVILA (NVILAQwen2) vision-language models, evaluating performance across four multimodal tasks (video captioning, video QA, image captioning, image QA) with optional LLM and vision tower quantization.

Description

This script provides a command-line interface for benchmarking NVILA models on four standard multimodal tasks, structured identically to the InternVL benchmark but targeting the NVILAQwen2 model architecture.

The main function begins by disabling PyTorch parameter initialization functions and HuggingFace _init_weights to accelerate model loading. It loads the model as NVILAQwen2 from tinychat.models.nvila_qwen2, instantiated from an AutoConfig with resume_path set to the model directory. The model is cast to half precision.

When --quant_llm or --all is specified, the LLM backbone (accessed via model.llm) undergoes W4A16 quantization using real_quantize_model_weight (4-bit weights, group size 128, zero-point enabled) with init_only=True, followed by fused kernel replacements: make_quant_attn, make_quant_norm, and make_fused_mlp. When --quant_VT or --all is specified, the SigLIP-based vision tower encoder is wrapped with QuantSiglipEncoder from tinychat.modules, targeting the nested path model.vision_tower.vision_tower.vision_model.encoder.

Each benchmark task constructs a prompt combining Image or Video media objects with task-specific text queries, resets the conversation template via clib.conv_templates, and calls model.benchmark(prompt, quant_llm) under torch.no_grad(). The four tasks are:

video_caption: "Elaborate on the visual and narrative elements of the video in detail."
video_QA: Multiple-choice question about observed actions in video.
image_caption: "Describe the image in detail."
image_QA: Multiple-choice question about text content in image.

Usage

Run from the command line to benchmark NVILA with various quantization configurations:

# Benchmark all tasks with full quantization
python tinychat/nvila_benchmark.py \
    --model-path /path/to/nvila \
    --quant_path /path/to/quant.pt \
    --all

# Benchmark video tasks only
python tinychat/nvila_benchmark.py \
    --model-path /path/to/nvila \
    --video_caption --video_QA

# Benchmark with LLM quantization only, all tasks
python tinychat/nvila_benchmark.py \
    --model-path /path/to/nvila \
    --quant_llm --all_task

Code Reference

Source Location

Repository: Mit_han_lab_Llm_awq
File: tinychat/nvila_benchmark.py
Lines: 1-163

Signature

def main() -> None:

Import

# CLI script, run directly:
python tinychat/nvila_benchmark.py [OPTIONS]

I/O Contract

CLI Arguments

Argument	Type	Default	Description
--model-path, -m	str	(required)	Path to NVILA model checkpoint
--quant_path	str	/PATH/TO/QUANT	Path to quantized weight file
--conv-mode, -c	str	auto	Conversation template mode
--device	str	cuda:0	CUDA device
--act_scale_path	str	/PATH/TO/SCALE	Path to activation scales
--quant_llm	flag	False	Quantize the LLM backbone (W4A16)
--quant_VT	flag	False	Quantize the vision tower (SigLIP)
--video_caption	flag	False	Run video captioning benchmark
--video_QA	flag	False	Run video QA benchmark
--image_caption	flag	False	Run image captioning benchmark
--image_QA	flag	False	Run image QA benchmark
--all	flag	False	Enable all quantization and all tasks
--all_task	flag	False	Run all four benchmark tasks
--fakequant_VT	flag	False	Use fake quantization for vision tower
--video_path	str	../figures/nvila_demo_video.mp4	Path to benchmark video
--image_path	str	../figures/vila-logo.jpg	Path to benchmark image
--max_seq_len	int	8192	Maximum sequence length

Output

Output	Description
stdout	Benchmark results printed per task with separator lines; includes model.benchmark() output (timing/throughput metrics)

Usage Examples

# Full NVILA benchmark with all quantization
python tinychat/nvila_benchmark.py \
    --model-path /models/nvila-8b \
    --quant_path /models/nvila-8b-w4-g128-awq.pt \
    --all \
    --video_path /data/test_video.mp4 \
    --image_path /data/test_image.jpg \
    --max_seq_len 8192

# Quick image-only test
python tinychat/nvila_benchmark.py \
    --model-path /models/nvila-8b \
    --image_caption --image_QA \
    --quant_llm

Related Pages

Principle:Mit_han_lab_Llm_awq_VLM_Benchmarking

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment