Workflow:Mit han lab Llm awq AWQ Model Quantization
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Model_Compression, Quantization |
| Last Updated | 2025-04-01 00:00 GMT |
Overview
End-to-end process for compressing large language model weights from FP16 to INT4 using Activation-aware Weight Quantization (AWQ), producing quantized checkpoints suitable for efficient inference.
Description
This workflow implements the AWQ quantization pipeline, which reduces LLM memory footprint by approximately 4x while preserving model accuracy. The method works by identifying salient weight channels (those corresponding to large activation magnitudes) and protecting them during quantization through per-channel scaling. It also applies weight clipping to further reduce quantization error. The process covers calibration data preparation, activation-aware scale search, weight clipping optimization, and final weight packing into INT4 format. The resulting quantized checkpoint can be used with TinyChat for fast inference or converted to HuggingFace format for broader deployment.
Supported models: LLaMA 1/2/3, Qwen 2/2.5, OPT, CodeLlama, StarCoder, Vicuna, Falcon, MPT, DeepSeek-R1, LLaVA, VILA, NVILA, InternVL3.
Usage
Execute this workflow when you have a pretrained LLM (or VLM) in HuggingFace format and need to reduce its memory footprint for deployment on resource-constrained hardware (e.g., consumer GPUs with less than 24GB VRAM, or edge devices like NVIDIA Jetson Orin). The output is a quantized checkpoint file (.pt) containing INT4-packed weights with associated scales and zero points.
Execution Steps
Step 1: Environment Setup
Install the AWQ package and its dependencies, including the custom CUDA kernels for quantized operations. This involves cloning the repository, creating a conda environment, installing the Python package in editable mode, and building the CUDA extension for W4A16 inference kernels.
Key considerations:
- Python 3.10 is recommended
- CUDA toolkit must be available for kernel compilation
- For edge devices (Jetson Orin), modify pyproject.toml to remove the torch dependency and install PyTorch from NVIDIA prebuilt binaries
- FlashAttention is optional but recommended for inference speed
Step 2: Calibration Data Preparation
Load a small calibration dataset to measure activation magnitudes across the model. The default calibration set is the Pile validation split (pileval). Samples are tokenized, concatenated, and split into fixed-length blocks for uniform processing through the model layers.
Key considerations:
- Default uses 128 samples of 512-token blocks from the Pile validation set
- Only textual calibration data is needed, even for multimodal models
- The calibration data drives the activation-aware scaling search but does not train the model
Step 3: AWQ Scale and Clip Search
Run the core AWQ search algorithm layer by layer through the model. For each transformer block, capture input activations to all linear layers, then search for optimal per-channel scaling factors that minimize quantization error on salient channels. Additionally, search for optimal weight clipping ranges to further reduce quantization loss. The search results (scales and clips) are saved to a cache file for reuse.
What happens per layer:
- Hook into all linear layers to capture input activations
- Run the activation-aware scaling search (grid search over scaling ratios)
- Run the MSE-based weight clipping optimization
- Record optimal scales and clips for each layer
Key considerations:
- This step is compute-intensive but only runs once per model
- Pre-computed search results are available for many popular models in the AWQ Model Zoo
- Results are saved as a .pt file containing scale and clip tensors
Step 4: Apply AWQ Transforms
Load the previously saved AWQ search results and apply the optimal scaling factors and weight clips to the full-precision model. This transforms the weight matrices so that salient channels are better preserved during the subsequent quantization step.
Key considerations:
- This step modifies the FP16 weights in-place before quantization
- Scales are applied to both weights and the preceding layer's output (to maintain mathematical equivalence)
- Clips truncate weight outliers to the optimized range
Step 5: Weight Quantization and Packing
Quantize the transformed model weights from FP16 to INT4 with group-wise quantization. Each group of weights (default 128 per group) shares a scale and zero-point. The quantized weights are packed into a compact format suitable for efficient CUDA kernel dispatch during inference.
Two modes available:
- Pseudo quantization (fake): Simulates quantization by rounding weights but keeping them in FP16 format. Used for evaluation only.
- Real quantization: Packs weights into actual INT4 representation with WQLinear modules. Produces the final checkpoint for deployment.
Key considerations:
- Default configuration is 4-bit with group size 128 and zero-point enabled
- The output file is automatically named with a v2 suffix for the current weight format
- 3-bit quantization (INT3) is also supported for some model families
Step 6: Save Quantized Checkpoint
Save the quantized model state dictionary as a PyTorch .pt file. This checkpoint contains the packed INT4 weights, scales, and zero-points for all quantized linear layers, while embedding and head layers remain in their original precision.
Key considerations:
- The checkpoint can be loaded by TinyChat for inference
- For HuggingFace integration, use the separate conversion script (convert_to_hf.py)
- For edge devices with shared memory, the checkpoint can be further split into per-layer shards using split_ckpt.py