Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mit han lab Llm awq NVILA Benchmark

From Leeroopedia
Knowledge Sources
Domains Benchmarking, Multimodal
Last Updated 2026-02-15 00:00 GMT

Overview

CLI benchmarking script for NVILA (NVILAQwen2) vision-language models, evaluating performance across four multimodal tasks (video captioning, video QA, image captioning, image QA) with optional LLM and vision tower quantization.

Description

This script provides a command-line interface for benchmarking NVILA models on four standard multimodal tasks, structured identically to the InternVL benchmark but targeting the NVILAQwen2 model architecture.

The main function begins by disabling PyTorch parameter initialization functions and HuggingFace _init_weights to accelerate model loading. It loads the model as NVILAQwen2 from tinychat.models.nvila_qwen2, instantiated from an AutoConfig with resume_path set to the model directory. The model is cast to half precision.

When --quant_llm or --all is specified, the LLM backbone (accessed via model.llm) undergoes W4A16 quantization using real_quantize_model_weight (4-bit weights, group size 128, zero-point enabled) with init_only=True, followed by fused kernel replacements: make_quant_attn, make_quant_norm, and make_fused_mlp. When --quant_VT or --all is specified, the SigLIP-based vision tower encoder is wrapped with QuantSiglipEncoder from tinychat.modules, targeting the nested path model.vision_tower.vision_tower.vision_model.encoder.

Each benchmark task constructs a prompt combining Image or Video media objects with task-specific text queries, resets the conversation template via clib.conv_templates, and calls model.benchmark(prompt, quant_llm) under torch.no_grad(). The four tasks are:

  • video_caption: "Elaborate on the visual and narrative elements of the video in detail."
  • video_QA: Multiple-choice question about observed actions in video.
  • image_caption: "Describe the image in detail."
  • image_QA: Multiple-choice question about text content in image.

Usage

Run from the command line to benchmark NVILA with various quantization configurations:

# Benchmark all tasks with full quantization
python tinychat/nvila_benchmark.py \
    --model-path /path/to/nvila \
    --quant_path /path/to/quant.pt \
    --all

# Benchmark video tasks only
python tinychat/nvila_benchmark.py \
    --model-path /path/to/nvila \
    --video_caption --video_QA

# Benchmark with LLM quantization only, all tasks
python tinychat/nvila_benchmark.py \
    --model-path /path/to/nvila \
    --quant_llm --all_task

Code Reference

Source Location

Signature

def main() -> None:

Import

# CLI script, run directly:
python tinychat/nvila_benchmark.py [OPTIONS]

I/O Contract

CLI Arguments

Argument Type Default Description
--model-path, -m str (required) Path to NVILA model checkpoint
--quant_path str /PATH/TO/QUANT Path to quantized weight file
--conv-mode, -c str auto Conversation template mode
--device str cuda:0 CUDA device
--act_scale_path str /PATH/TO/SCALE Path to activation scales
--quant_llm flag False Quantize the LLM backbone (W4A16)
--quant_VT flag False Quantize the vision tower (SigLIP)
--video_caption flag False Run video captioning benchmark
--video_QA flag False Run video QA benchmark
--image_caption flag False Run image captioning benchmark
--image_QA flag False Run image QA benchmark
--all flag False Enable all quantization and all tasks
--all_task flag False Run all four benchmark tasks
--fakequant_VT flag False Use fake quantization for vision tower
--video_path str ../figures/nvila_demo_video.mp4 Path to benchmark video
--image_path str ../figures/vila-logo.jpg Path to benchmark image
--max_seq_len int 8192 Maximum sequence length

Output

Output Description
stdout Benchmark results printed per task with separator lines; includes model.benchmark() output (timing/throughput metrics)

Usage Examples

# Full NVILA benchmark with all quantization
python tinychat/nvila_benchmark.py \
    --model-path /models/nvila-8b \
    --quant_path /models/nvila-8b-w4-g128-awq.pt \
    --all \
    --video_path /data/test_video.mp4 \
    --image_path /data/test_image.jpg \
    --max_seq_len 8192

# Quick image-only test
python tinychat/nvila_benchmark.py \
    --model-path /models/nvila-8b \
    --image_caption --image_QA \
    --quant_llm

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment