Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Mit han lab Llm awq TinyChat LLM Deployment

From Leeroopedia
Knowledge Sources
Domains LLMs, Inference, Edge_Deployment
Last Updated 2025-04-01 00:00 GMT

Overview

End-to-end process for deploying AWQ-quantized language models as interactive chatbots using TinyChat, with optimized CUDA kernels for fast inference on both cloud and edge GPUs.

Description

This workflow takes a pre-quantized AWQ checkpoint and deploys it as an interactive text-generation chatbot through the TinyChat inference engine. TinyChat replaces standard HuggingFace model layers with fused CUDA kernels for attention, MLP, and normalization, achieving 2-3x speedup over FP16 baselines. It supports pre-allocated KV caches, chunk prefilling for multi-round dialogue efficiency, and FlashAttention integration. The engine supports LLaMA, Qwen2, Falcon, and MPT model architectures in W4A16 (4-bit weight, 16-bit activation) precision.

Usage

Execute this workflow when you have an AWQ-quantized checkpoint (.pt file) and want to run interactive text generation on a GPU. This is suitable for deploying chatbots on consumer GPUs (RTX 3090/4090), server GPUs (A100), or edge devices (NVIDIA Jetson Orin) with minimal latency.

Execution Steps

Step 1: Prepare Quantized Checkpoint

Obtain or generate the AWQ-quantized model checkpoint. Either run the AWQ Model Quantization workflow to produce a checkpoint, or download pre-quantized weights from the AWQ Model Zoo or HuggingFace Hub. Ensure the checkpoint is in v2 format (filename ending in v2.pt). For older v1 checkpoints, use the offline weight repacker to convert them.

Key considerations:

  • The original HuggingFace model directory is still needed for tokenizer and config loading
  • For edge devices with shared host/device memory, split the checkpoint into per-layer shards using split_ckpt.py to reduce peak memory during loading

Step 2: Initialize Model Architecture

Load the model configuration from the original HuggingFace model path, then instantiate the TinyChat-specific model architecture (e.g., LlamaForCausalLM from tinychat.models). The TinyChat model implementations include pre-allocated KV caches and use custom forward passes optimized for sequential token generation.

Key considerations:

  • TinyChat has its own model implementations separate from HuggingFace (tinychat/models/)
  • Maximum sequence length and batch size must be set before model instantiation as they determine KV cache allocation
  • Neural network weight initialization is skipped to speed up model creation

Step 3: Load Quantized Weights

Load the INT4-packed weights from the AWQ checkpoint into the TinyChat model. Linear layers are replaced with WQLinear modules that dispatch to efficient CUDA kernels for W4A16 matrix multiplication. For LLaMA and Qwen2 models, a fast loading path directly replaces nn.Linear with WQLinear modules before loading weights.

Key considerations:

  • Memory-efficient loading mode is available for edge devices, loading weights layer by layer
  • The loader automatically handles both .pt and .safetensors formats
  • Weight version compatibility is checked automatically

Step 4: Apply Kernel Optimizations

Replace the model's attention and normalization layers with fused CUDA implementations. The fused attention module combines query/key/value projection, rotary position embedding, and multi-head attention into a single kernel call. Normalization layers are replaced with CUDA-accelerated RMSNorm/LayerNorm implementations. A device warmup pass is run to prepare the GPU.

Key considerations:

  • FlashAttention can be enabled for faster prefilling of long contexts
  • Fused kernels are currently supported for LLaMA and Qwen2 architectures
  • A dummy forward pass is run after optimization to ensure all CUDA kernels are compiled

Step 5: Configure Prompt Template

Set up the appropriate conversation prompt template for the target model. Different model families use different chat templates (e.g., LLaMA-2 uses [INST] tags, Vicuna uses "A chat between a curious user and an AI assistant"). The prompter manages conversation history for multi-round dialogue and applies the correct formatting.

Key considerations:

  • Stop tokens are model-specific and must be configured correctly to detect generation completion
  • Short prompt mode is activated for models with max sequence length under 1024 tokens
  • The prompt template preserves conversation history for multi-turn interactions when chunk prefilling is enabled

Step 6: Run Interactive Generation Loop

Enter the interactive chat loop where user input is formatted through the prompt template, tokenized, and fed to the model. The StreamGenerator produces tokens one at a time using the model's forward pass with KV cache, applying sampling strategies (temperature, top-k, top-p, repetition penalty). Output is streamed to the terminal with real-time display and timing statistics.

Key considerations:

  • Chunk prefilling reuses KV cache from previous turns, avoiding redundant computation (up to 11x speedup in multi-round dialogue)
  • Generation stops on model-specific stop tokens or maximum token limit
  • Timing statistics report TTFT (time to first token) and per-token generation speed

Execution Diagram

GitHub URL

Workflow Repository