Principle:InternLM Lmdeploy AWQ Weight Quantization
| Knowledge Sources | |
|---|---|
| Domains | Model_Compression, Quantization |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
An activation-aware weight quantization algorithm that compresses model weights to 4-bit integers while preserving quality by protecting salient weight channels identified through activation analysis.
Description
AWQ (Activation-aware Weight Quantization) reduces LLM memory footprint by approximately 4x through 4-bit integer quantization of model weights (W4A16: 4-bit weights, 16-bit activations). The key insight is that not all weight channels are equally important: channels corresponding to large activation magnitudes have a disproportionate impact on output quality.
The AWQ algorithm:
- Collects activation statistics from a calibration dataset
- Identifies salient weight channels based on activation magnitudes
- Applies per-group asymmetric quantization with scale search
- Optionally searches for optimal scaling factors to minimize quantization error
AWQ-quantized models are served using the TurboMind backend with optimized INT4 GEMM kernels.
Usage
Use AWQ when you need to reduce model memory by ~4x for deployment on limited GPU memory. Preferred over GPTQ for most use cases due to better accuracy preservation and faster quantization. Requires a calibration dataset (default: WikiText-2, 128 samples).
Theoretical Basis
AWQ identifies salient channels using activation magnitudes and protects them during quantization:
Where is the activation for channel c and is the weight for channel c. High-saliency channels are quantized with finer granularity.
The quantization formula per group: Failed to parse (syntax error): {\displaystyle W_q = \text{round}\left(\frac{W - \text{zero\_point}}{\text{scale}}\right)}
With group size typically 128 (each group of 128 weights shares a scale/zero-point pair).