Principle:Mit han lab Llm awq Activation Aware Weight Quantization
Overview
Activation-Aware Weight Quantization (AWQ) is an algorithm that protects salient weights during low-bit quantization by finding per-channel scaling factors based on activation magnitudes.
Description
The central insight of AWQ is that a small fraction of weights (0.1% to 1%) in a large language model are disproportionately important for model quality. These "salient" weights correspond to input channels with large activation magnitudes. When these weights are quantized naively to low bit-widths (e.g., INT4), the resulting quantization error on these critical channels causes significant degradation in model output.
A straightforward solution would be to keep salient weights in higher precision (mixed-precision quantization). However, this approach requires specialized hardware support for mixed-precision matrix operations, which limits practical deployment.
AWQ takes a different approach: rather than using mixed precision, it applies equivalent mathematical transformations that reduce quantization error on salient channels. Specifically, AWQ finds a per-channel scaling factor for each weight group. By scaling up the salient weight channels before quantization and scaling down the corresponding activations, the effective quantization granularity for important weights is increased. This trades a slight increase in quantization error on less important channels for a significant reduction in error on salient ones.
The transformation is mathematically equivalent -- it does not change the floating-point output of the layer. It only changes how quantization error is distributed across channels, concentrating precision where it matters most.
The algorithm proceeds as follows:
- For each transformer block, identify groups of linear layers that share a common input (e.g., Q/K/V projections sharing the same LayerNorm output).
- For each group, compute activation statistics from calibration data to identify salient channels.
- Perform a grid search over scaling ratios to find the per-channel scaling that minimizes output MSE after quantization.
- Apply the scaling by modifying the preceding operation (LayerNorm or linear layer) and the target weights.
Usage
AWQ is used when compressing large language models from FP16 to INT4 (or other low-bit formats) while preserving model quality. It is particularly effective for:
- Deploying LLMs on edge devices with limited memory
- Reducing GPU memory requirements for inference serving
- Achieving near-lossless INT4 quantization without retraining
AWQ serves as an alternative to other post-training quantization methods:
- Round-to-Nearest (RTN): Simple rounding without any calibration-based optimization. AWQ consistently outperforms RTN.
- GPTQ: Uses second-order (Hessian) information to optimize weight rounding decisions. AWQ achieves comparable or better quality with lower computational cost.
- SmoothQuant: Focuses on activation quantization by migrating quantization difficulty from activations to weights. AWQ focuses on weight-only quantization.
Theoretical Basis
Given a weight matrix W and input activation matrix X, standard quantization applies:
Q(W) * X
where Q is the quantization function. The quantization error is:
err = ||W * X - Q(W) * X||
AWQ introduces a diagonal scaling matrix s and reformulates:
Q(W * diag(s)) * diag(s)^{-1} * X
This is mathematically equivalent in full precision (since diag(s) * diag(s)^{-1} = I), but changes how quantization error is distributed. The scaling vector s is defined as:
s = x_max^alpha
where x_max is the per-channel maximum activation magnitude computed from calibration data, and alpha is a scaling ratio in the range [0, 1].
The optimal alpha is found by grid search over 20 evenly-spaced points:
alpha in {0, 1/20, 2/20, ..., 19/20}
For each candidate alpha, the algorithm:
- Computes the scaling vector s = x_max^alpha
- Applies the scaling to weights and the preceding operation
- Quantizes the scaled weights
- Measures the output MSE: MSE = ||block(x) - block_quant(x)||^2
- Selects the alpha that yields the minimum MSE
This per-channel scaling search is performed independently for each group of linked linear layers within each transformer block.
Related Pages
Knowledge Sources
- Paper|AWQ|https://arxiv.org/abs/2306.00978
Domains
- NLP
- Quantization
- Deep_Learning