Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mit han lab Llm awq InternVL Demo

From Leeroopedia
Knowledge Sources
Domains Demo, Multimodal
Last Updated 2026-02-15 00:00 GMT

Overview

Interactive multimodal chat demo for InternVL3 models with support for smooth quantization, image/video input, streaming generation, multi-turn conversation, and chunk prefilling optimization.

Description

This script provides a full-featured interactive command-line chat interface for InternVL3 vision-language models, combining model loading, quantization, media processing, and streaming text generation.

tune_intern_patch_embedding is a warmup function that runs 100 forward passes of a random 336x336 image through the vision model's patch embedding layer on the target device, allowing CUDA to optimize kernel execution for subsequent inference.

main handles the complete lifecycle of the demo:

Model Loading: The InternVL3 model is loaded via AutoConfig with initialization acceleration (disabled parameter init functions and weight initialization). When --quant_llm or --all is specified, InternVL3.from_pretrained is used; otherwise, the LLM component (Qwen2ForCausalLM) is loaded separately, resized for the tokenizer vocabulary, and passed to InternVL3. The model is cast to half precision.

Quantization Pipeline: Three quantization options are available. --smooth_VT applies smooth quantization to the vision tower using pre-computed activation scales. --quant_llm loads AWQ-quantized weights via load_awq_model (4-bit, group size 128) and applies fused kernel replacements (make_quant_attn, make_quant_norm). --quant_VT wraps the vision encoder with QuantInternVisionEncoder. The --all flag enables all three simultaneously.

Media Preparation: The script accepts --media as one or more image (.jpg/.jpeg/.png) or video (.mp4/.mkv/.webm) files. Each is wrapped in Image or Video objects from llava.media, and model.prepare_media processes them into tensors and configurations. Terminal visualization is optionally available via vis_images.

Chat Loop: An interactive while-True loop reads user input, constructs prompts using get_prompter (which selects the appropriate template based on model type), and generates streaming responses via InternVLStreamGenerator. On the first turn, media placeholders (<image> or video frame prefixes) are prepended. stream_output handles real-time console display and timing statistics. Multi-turn conversation is supported by model_prompter.update_template, with chunk_prefilling avoiding recomputation of history tokens across turns. Empty input triggers exit with timing statistics display.

Usage

Run from the command line to start an interactive InternVL3 chat session:

# Basic image chat with quantization
python tinychat/internvl_demo.py \
    --model-path /path/to/internvl3 \
    --quant_path /path/to/quant.pt \
    --media image.jpg \
    --quant_llm --quant_VT \
    --chunk_prefilling

# Video chat without quantization
python tinychat/internvl_demo.py \
    --model-path /path/to/internvl3 \
    --media video.mp4

Code Reference

Source Location

Signature

def tune_intern_patch_embedding(vision_model, device):

def main(args):

Import

# CLI script, run directly:
python tinychat/internvl_demo.py [OPTIONS]

I/O Contract

CLI Arguments

Argument Type Default Description
--model_type str LLaMa Model type identifier for prompt template selection
--model-path str (required) Path to InternVL3 model checkpoint
--quant_path str (path) Path to AWQ quantized weight file
--act_scale_path str /PATH/TO/SCALE Path to activation scales for smooth quant
--media str (nargs=+) None Image or video file paths for multimodal input
--device str cuda CUDA device
--max_seq_len int 4098 Maximum sequence length / KV cache size
--single_round flag False Disable multi-turn conversation memory
--vis-image flag False Visualize input images in terminal
--empty-prompt flag False Use empty prompt template
--flash_attn flag False Enable flash attention
--chunk_prefilling flag False Enable chunk prefilling for multi-turn speedup
--quant_llm flag False Load AWQ-quantized LLM weights
--quant_VT flag False Quantize vision tower encoder
--smooth_VT flag False Apply smooth quantization to vision tower
--all flag False Enable all quantization options
--fakequant_VT flag False Use fake quantization for vision tower

Interactive I/O

Direction Description
Input User text prompts via stdin; empty input exits
Output Streamed assistant responses to stdout with timing statistics on exit

Usage Examples

# Full quantization with image input and chunk prefilling
python tinychat/internvl_demo.py \
    --model_type LLaMa \
    --model-path /models/internvl3-8b \
    --quant_path /models/internvl3-8b-w4-g128-awq.pt \
    --act_scale_path /models/act_scales.pt \
    --media /data/photo.jpg \
    --all \
    --chunk_prefilling \
    --max_seq_len 4098

# Interactive session flow:
# USER: What do you see in this image?
# ASSISTANT: The image shows...
# USER: Can you describe the colors?
# ASSISTANT: The dominant colors are...
# USER: (empty input to exit)
# EXIT... (timing stats displayed)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment