Implementation:Mit han lab Llm awq InternVL Demo
| Knowledge Sources | |
|---|---|
| Domains | Demo, Multimodal |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Interactive multimodal chat demo for InternVL3 models with support for smooth quantization, image/video input, streaming generation, multi-turn conversation, and chunk prefilling optimization.
Description
This script provides a full-featured interactive command-line chat interface for InternVL3 vision-language models, combining model loading, quantization, media processing, and streaming text generation.
tune_intern_patch_embedding is a warmup function that runs 100 forward passes of a random 336x336 image through the vision model's patch embedding layer on the target device, allowing CUDA to optimize kernel execution for subsequent inference.
main handles the complete lifecycle of the demo:
Model Loading: The InternVL3 model is loaded via AutoConfig with initialization acceleration (disabled parameter init functions and weight initialization). When --quant_llm or --all is specified, InternVL3.from_pretrained is used; otherwise, the LLM component (Qwen2ForCausalLM) is loaded separately, resized for the tokenizer vocabulary, and passed to InternVL3. The model is cast to half precision.
Quantization Pipeline: Three quantization options are available. --smooth_VT applies smooth quantization to the vision tower using pre-computed activation scales. --quant_llm loads AWQ-quantized weights via load_awq_model (4-bit, group size 128) and applies fused kernel replacements (make_quant_attn, make_quant_norm). --quant_VT wraps the vision encoder with QuantInternVisionEncoder. The --all flag enables all three simultaneously.
Media Preparation: The script accepts --media as one or more image (.jpg/.jpeg/.png) or video (.mp4/.mkv/.webm) files. Each is wrapped in Image or Video objects from llava.media, and model.prepare_media processes them into tensors and configurations. Terminal visualization is optionally available via vis_images.
Chat Loop: An interactive while-True loop reads user input, constructs prompts using get_prompter (which selects the appropriate template based on model type), and generates streaming responses via InternVLStreamGenerator. On the first turn, media placeholders (<image> or video frame prefixes) are prepended. stream_output handles real-time console display and timing statistics. Multi-turn conversation is supported by model_prompter.update_template, with chunk_prefilling avoiding recomputation of history tokens across turns. Empty input triggers exit with timing statistics display.
Usage
Run from the command line to start an interactive InternVL3 chat session:
# Basic image chat with quantization
python tinychat/internvl_demo.py \
--model-path /path/to/internvl3 \
--quant_path /path/to/quant.pt \
--media image.jpg \
--quant_llm --quant_VT \
--chunk_prefilling
# Video chat without quantization
python tinychat/internvl_demo.py \
--model-path /path/to/internvl3 \
--media video.mp4
Code Reference
Source Location
- Repository: Mit_han_lab_Llm_awq
- File: tinychat/internvl_demo.py
- Lines: 1-270
Signature
def tune_intern_patch_embedding(vision_model, device):
def main(args):
Import
# CLI script, run directly:
python tinychat/internvl_demo.py [OPTIONS]
I/O Contract
CLI Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
| --model_type | str | LLaMa | Model type identifier for prompt template selection |
| --model-path | str | (required) | Path to InternVL3 model checkpoint |
| --quant_path | str | (path) | Path to AWQ quantized weight file |
| --act_scale_path | str | /PATH/TO/SCALE | Path to activation scales for smooth quant |
| --media | str (nargs=+) | None | Image or video file paths for multimodal input |
| --device | str | cuda | CUDA device |
| --max_seq_len | int | 4098 | Maximum sequence length / KV cache size |
| --single_round | flag | False | Disable multi-turn conversation memory |
| --vis-image | flag | False | Visualize input images in terminal |
| --empty-prompt | flag | False | Use empty prompt template |
| --flash_attn | flag | False | Enable flash attention |
| --chunk_prefilling | flag | False | Enable chunk prefilling for multi-turn speedup |
| --quant_llm | flag | False | Load AWQ-quantized LLM weights |
| --quant_VT | flag | False | Quantize vision tower encoder |
| --smooth_VT | flag | False | Apply smooth quantization to vision tower |
| --all | flag | False | Enable all quantization options |
| --fakequant_VT | flag | False | Use fake quantization for vision tower |
Interactive I/O
| Direction | Description |
|---|---|
| Input | User text prompts via stdin; empty input exits |
| Output | Streamed assistant responses to stdout with timing statistics on exit |
Usage Examples
# Full quantization with image input and chunk prefilling
python tinychat/internvl_demo.py \
--model_type LLaMa \
--model-path /models/internvl3-8b \
--quant_path /models/internvl3-8b-w4-g128-awq.pt \
--act_scale_path /models/act_scales.pt \
--media /data/photo.jpg \
--all \
--chunk_prefilling \
--max_seq_len 4098
# Interactive session flow:
# USER: What do you see in this image?
# ASSISTANT: The image shows...
# USER: Can you describe the colors?
# ASSISTANT: The dominant colors are...
# USER: (empty input to exit)
# EXIT... (timing stats displayed)