Implementation:Mit han lab Llm awq InternVL Demo

Knowledge Sources	Mit_han_lab_Llm_awq
Domains	Demo, Multimodal
Last Updated	2026-02-15 00:00 GMT

Overview

Interactive multimodal chat demo for InternVL3 models with support for smooth quantization, image/video input, streaming generation, multi-turn conversation, and chunk prefilling optimization.

Description

This script provides a full-featured interactive command-line chat interface for InternVL3 vision-language models, combining model loading, quantization, media processing, and streaming text generation.

tune_intern_patch_embedding is a warmup function that runs 100 forward passes of a random 336x336 image through the vision model's patch embedding layer on the target device, allowing CUDA to optimize kernel execution for subsequent inference.

main handles the complete lifecycle of the demo:

Model Loading: The InternVL3 model is loaded via AutoConfig with initialization acceleration (disabled parameter init functions and weight initialization). When --quant_llm or --all is specified, InternVL3.from_pretrained is used; otherwise, the LLM component (Qwen2ForCausalLM) is loaded separately, resized for the tokenizer vocabulary, and passed to InternVL3. The model is cast to half precision.

Quantization Pipeline: Three quantization options are available. --smooth_VT applies smooth quantization to the vision tower using pre-computed activation scales. --quant_llm loads AWQ-quantized weights via load_awq_model (4-bit, group size 128) and applies fused kernel replacements (make_quant_attn, make_quant_norm). --quant_VT wraps the vision encoder with QuantInternVisionEncoder. The --all flag enables all three simultaneously.

Media Preparation: The script accepts --media as one or more image (.jpg/.jpeg/.png) or video (.mp4/.mkv/.webm) files. Each is wrapped in Image or Video objects from llava.media, and model.prepare_media processes them into tensors and configurations. Terminal visualization is optionally available via vis_images.

Chat Loop: An interactive while-True loop reads user input, constructs prompts using get_prompter (which selects the appropriate template based on model type), and generates streaming responses via InternVLStreamGenerator. On the first turn, media placeholders (<image> or video frame prefixes) are prepended. stream_output handles real-time console display and timing statistics. Multi-turn conversation is supported by model_prompter.update_template, with chunk_prefilling avoiding recomputation of history tokens across turns. Empty input triggers exit with timing statistics display.

Usage

Run from the command line to start an interactive InternVL3 chat session:

# Basic image chat with quantization
python tinychat/internvl_demo.py \
    --model-path /path/to/internvl3 \
    --quant_path /path/to/quant.pt \
    --media image.jpg \
    --quant_llm --quant_VT \
    --chunk_prefilling

# Video chat without quantization
python tinychat/internvl_demo.py \
    --model-path /path/to/internvl3 \
    --media video.mp4

Code Reference

Source Location

Repository: Mit_han_lab_Llm_awq
File: tinychat/internvl_demo.py
Lines: 1-270

Signature

def tune_intern_patch_embedding(vision_model, device):

def main(args):

Import

# CLI script, run directly:
python tinychat/internvl_demo.py [OPTIONS]

I/O Contract

CLI Arguments

Argument	Type	Default	Description
--model_type	str	LLaMa	Model type identifier for prompt template selection
--model-path	str	(required)	Path to InternVL3 model checkpoint
--quant_path	str	(path)	Path to AWQ quantized weight file
--act_scale_path	str	/PATH/TO/SCALE	Path to activation scales for smooth quant
--media	str (nargs=+)	None	Image or video file paths for multimodal input
--device	str	cuda	CUDA device
--max_seq_len	int	4098	Maximum sequence length / KV cache size
--single_round	flag	False	Disable multi-turn conversation memory
--vis-image	flag	False	Visualize input images in terminal
--empty-prompt	flag	False	Use empty prompt template
--flash_attn	flag	False	Enable flash attention
--chunk_prefilling	flag	False	Enable chunk prefilling for multi-turn speedup
--quant_llm	flag	False	Load AWQ-quantized LLM weights
--quant_VT	flag	False	Quantize vision tower encoder
--smooth_VT	flag	False	Apply smooth quantization to vision tower
--all	flag	False	Enable all quantization options
--fakequant_VT	flag	False	Use fake quantization for vision tower

Interactive I/O

Direction	Description
Input	User text prompts via stdin; empty input exits
Output	Streamed assistant responses to stdout with timing statistics on exit

Usage Examples

# Full quantization with image input and chunk prefilling
python tinychat/internvl_demo.py \
    --model_type LLaMa \
    --model-path /models/internvl3-8b \
    --quant_path /models/internvl3-8b-w4-g128-awq.pt \
    --act_scale_path /models/act_scales.pt \
    --media /data/photo.jpg \
    --all \
    --chunk_prefilling \
    --max_seq_len 4098

# Interactive session flow:
# USER: What do you see in this image?
# ASSISTANT: The image shows...
# USER: Can you describe the colors?
# ASSISTANT: The dominant colors are...
# USER: (empty input to exit)
# EXIT... (timing stats displayed)

Related Pages

Principle:Mit_han_lab_Llm_awq_Interactive_Multimodal_Demo

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment