Implementation:Mit han lab Llm awq NVILA Benchmark
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Multimodal |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
CLI benchmarking script for NVILA (NVILAQwen2) vision-language models, evaluating performance across four multimodal tasks (video captioning, video QA, image captioning, image QA) with optional LLM and vision tower quantization.
Description
This script provides a command-line interface for benchmarking NVILA models on four standard multimodal tasks, structured identically to the InternVL benchmark but targeting the NVILAQwen2 model architecture.
The main function begins by disabling PyTorch parameter initialization functions and HuggingFace _init_weights to accelerate model loading. It loads the model as NVILAQwen2 from tinychat.models.nvila_qwen2, instantiated from an AutoConfig with resume_path set to the model directory. The model is cast to half precision.
When --quant_llm or --all is specified, the LLM backbone (accessed via model.llm) undergoes W4A16 quantization using real_quantize_model_weight (4-bit weights, group size 128, zero-point enabled) with init_only=True, followed by fused kernel replacements: make_quant_attn, make_quant_norm, and make_fused_mlp. When --quant_VT or --all is specified, the SigLIP-based vision tower encoder is wrapped with QuantSiglipEncoder from tinychat.modules, targeting the nested path model.vision_tower.vision_tower.vision_model.encoder.
Each benchmark task constructs a prompt combining Image or Video media objects with task-specific text queries, resets the conversation template via clib.conv_templates, and calls model.benchmark(prompt, quant_llm) under torch.no_grad(). The four tasks are:
- video_caption: "Elaborate on the visual and narrative elements of the video in detail."
- video_QA: Multiple-choice question about observed actions in video.
- image_caption: "Describe the image in detail."
- image_QA: Multiple-choice question about text content in image.
Usage
Run from the command line to benchmark NVILA with various quantization configurations:
# Benchmark all tasks with full quantization
python tinychat/nvila_benchmark.py \
--model-path /path/to/nvila \
--quant_path /path/to/quant.pt \
--all
# Benchmark video tasks only
python tinychat/nvila_benchmark.py \
--model-path /path/to/nvila \
--video_caption --video_QA
# Benchmark with LLM quantization only, all tasks
python tinychat/nvila_benchmark.py \
--model-path /path/to/nvila \
--quant_llm --all_task
Code Reference
Source Location
- Repository: Mit_han_lab_Llm_awq
- File: tinychat/nvila_benchmark.py
- Lines: 1-163
Signature
def main() -> None:
Import
# CLI script, run directly:
python tinychat/nvila_benchmark.py [OPTIONS]
I/O Contract
CLI Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
| --model-path, -m | str | (required) | Path to NVILA model checkpoint |
| --quant_path | str | /PATH/TO/QUANT | Path to quantized weight file |
| --conv-mode, -c | str | auto | Conversation template mode |
| --device | str | cuda:0 | CUDA device |
| --act_scale_path | str | /PATH/TO/SCALE | Path to activation scales |
| --quant_llm | flag | False | Quantize the LLM backbone (W4A16) |
| --quant_VT | flag | False | Quantize the vision tower (SigLIP) |
| --video_caption | flag | False | Run video captioning benchmark |
| --video_QA | flag | False | Run video QA benchmark |
| --image_caption | flag | False | Run image captioning benchmark |
| --image_QA | flag | False | Run image QA benchmark |
| --all | flag | False | Enable all quantization and all tasks |
| --all_task | flag | False | Run all four benchmark tasks |
| --fakequant_VT | flag | False | Use fake quantization for vision tower |
| --video_path | str | ../figures/nvila_demo_video.mp4 | Path to benchmark video |
| --image_path | str | ../figures/vila-logo.jpg | Path to benchmark image |
| --max_seq_len | int | 8192 | Maximum sequence length |
Output
| Output | Description |
|---|---|
| stdout | Benchmark results printed per task with separator lines; includes model.benchmark() output (timing/throughput metrics) |
Usage Examples
# Full NVILA benchmark with all quantization
python tinychat/nvila_benchmark.py \
--model-path /models/nvila-8b \
--quant_path /models/nvila-8b-w4-g128-awq.pt \
--all \
--video_path /data/test_video.mp4 \
--image_path /data/test_image.jpg \
--max_seq_len 8192
# Quick image-only test
python tinychat/nvila_benchmark.py \
--model-path /models/nvila-8b \
--image_caption --image_QA \
--quant_llm