Workflow:Mit han lab Llm awq AWQ Model Quantization

Knowledge Sources	llm-awq AWQ: Activation-aware Weight Quantization AWQ Model Zoo
Domains	LLMs, Model_Compression, Quantization
Last Updated	2025-04-01 00:00 GMT

Overview

End-to-end process for compressing large language model weights from FP16 to INT4 using Activation-aware Weight Quantization (AWQ), producing quantized checkpoints suitable for efficient inference.

Description

This workflow implements the AWQ quantization pipeline, which reduces LLM memory footprint by approximately 4x while preserving model accuracy. The method works by identifying salient weight channels (those corresponding to large activation magnitudes) and protecting them during quantization through per-channel scaling. It also applies weight clipping to further reduce quantization error. The process covers calibration data preparation, activation-aware scale search, weight clipping optimization, and final weight packing into INT4 format. The resulting quantized checkpoint can be used with TinyChat for fast inference or converted to HuggingFace format for broader deployment.

Supported models: LLaMA 1/2/3, Qwen 2/2.5, OPT, CodeLlama, StarCoder, Vicuna, Falcon, MPT, DeepSeek-R1, LLaVA, VILA, NVILA, InternVL3.

Usage

Execute this workflow when you have a pretrained LLM (or VLM) in HuggingFace format and need to reduce its memory footprint for deployment on resource-constrained hardware (e.g., consumer GPUs with less than 24GB VRAM, or edge devices like NVIDIA Jetson Orin). The output is a quantized checkpoint file (.pt) containing INT4-packed weights with associated scales and zero points.

Execution Steps

Step 1: Environment Setup

Install the AWQ package and its dependencies, including the custom CUDA kernels for quantized operations. This involves cloning the repository, creating a conda environment, installing the Python package in editable mode, and building the CUDA extension for W4A16 inference kernels.

Key considerations:

Python 3.10 is recommended
CUDA toolkit must be available for kernel compilation
For edge devices (Jetson Orin), modify pyproject.toml to remove the torch dependency and install PyTorch from NVIDIA prebuilt binaries
FlashAttention is optional but recommended for inference speed

Step 2: Calibration Data Preparation

Load a small calibration dataset to measure activation magnitudes across the model. The default calibration set is the Pile validation split (pileval). Samples are tokenized, concatenated, and split into fixed-length blocks for uniform processing through the model layers.

Key considerations:

Default uses 128 samples of 512-token blocks from the Pile validation set
Only textual calibration data is needed, even for multimodal models
The calibration data drives the activation-aware scaling search but does not train the model

Step 3: AWQ Scale and Clip Search

Run the core AWQ search algorithm layer by layer through the model. For each transformer block, capture input activations to all linear layers, then search for optimal per-channel scaling factors that minimize quantization error on salient channels. Additionally, search for optimal weight clipping ranges to further reduce quantization loss. The search results (scales and clips) are saved to a cache file for reuse.

What happens per layer:

Hook into all linear layers to capture input activations
Run the activation-aware scaling search (grid search over scaling ratios)
Run the MSE-based weight clipping optimization
Record optimal scales and clips for each layer

Key considerations:

This step is compute-intensive but only runs once per model
Pre-computed search results are available for many popular models in the AWQ Model Zoo
Results are saved as a .pt file containing scale and clip tensors

Step 4: Apply AWQ Transforms

Load the previously saved AWQ search results and apply the optimal scaling factors and weight clips to the full-precision model. This transforms the weight matrices so that salient channels are better preserved during the subsequent quantization step.

Key considerations:

This step modifies the FP16 weights in-place before quantization
Scales are applied to both weights and the preceding layer's output (to maintain mathematical equivalence)
Clips truncate weight outliers to the optimized range

Step 5: Weight Quantization and Packing

Quantize the transformed model weights from FP16 to INT4 with group-wise quantization. Each group of weights (default 128 per group) shares a scale and zero-point. The quantized weights are packed into a compact format suitable for efficient CUDA kernel dispatch during inference.

Two modes available:

Pseudo quantization (fake): Simulates quantization by rounding weights but keeping them in FP16 format. Used for evaluation only.
Real quantization: Packs weights into actual INT4 representation with WQLinear modules. Produces the final checkpoint for deployment.

Key considerations:

Default configuration is 4-bit with group size 128 and zero-point enabled
The output file is automatically named with a v2 suffix for the current weight format
3-bit quantization (INT3) is also supported for some model families

Step 6: Save Quantized Checkpoint

Save the quantized model state dictionary as a PyTorch .pt file. This checkpoint contains the packed INT4 weights, scales, and zero-points for all quantized linear layers, while embedding and head layers remain in their original precision.

Key considerations:

The checkpoint can be loaded by TinyChat for inference
For HuggingFace integration, use the separate conversion script (convert_to_hf.py)
For edge devices with shared memory, the checkpoint can be further split into per-layer shards using split_ckpt.py

Execution Diagram

GitHub URL

Workflow Repository