Principle:Mit han lab Llm awq Pseudo Quantization
Overview
Technique that simulates the numerical effects of low-bit quantization on FP16 weights without actually packing them, enabling accuracy evaluation.
Description
Pseudo quantization applies the quantize-then-dequantize operation to weight tensors: weights are rounded to n-bit grid values and then scaled back to FP16. This introduces the same quantization noise as real INT4 deployment but keeps weights in FP16 format, allowing standard PyTorch inference without custom CUDA kernels. Used for evaluating quantization quality (perplexity, benchmarks) before committing to real quantization.
Usage
When evaluating AWQ quantization quality without deploying with custom kernels (--q_backend fake mode).
Theoretical Basis
w_fake = dequant(quant(w)) = ((round(w/s) + z).clamp(min, max) - z) * s
where:
- s = (max - min) / (2^n - 1)
- z = round(-min / s)
Related Pages
Knowledge Sources
- Paper|AWQ|https://arxiv.org/abs/2306.00978
Domains
- Quantization
- Evaluation