Principle:Mit han lab Llm awq AWQ Transform Application
Overview
Process of applying precomputed activation-aware scaling and clipping transforms to model weights prior to quantization.
Description
After the AWQ search phase produces optimal per-channel scales and per-group clipping values, these transforms must be applied to the model weights. Scaling is absorbed into the preceding operation (LayerNorm, Linear, or activation function) and corresponding linear layers. Clipping directly constrains weight values. This separation of search and application enables saving/loading AWQ results without re-running the expensive search.
Usage
After loading AWQ search results (from --load_awq checkpoint) and before quantization or evaluation.
Theoretical Basis
Two operations:
- apply_scale - Redistributes weight magnitude through equivalent transforms on adjacent layers
- apply_clip - Constrains weights to optimal ranges
Both preserve mathematical equivalence while reducing quantization error.
Related Pages
Knowledge Sources
- Paper|AWQ|https://arxiv.org/abs/2306.00978
Domains
- Quantization
- NLP