Principle:Mit han lab Llm awq W8A8 Quantized Linear
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Model_Architecture |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Principle of implementing INT8 weight and INT8 activation linear layers with dynamic input scaling and FP16 output for efficient inference.
Description
W8A8 quantized linear layers store weights in INT8 format and quantize activations to INT8 at runtime. The dynamic input scaling variant computes per-token scales to handle varying activation magnitudes. The output is accumulated in FP16 to maintain downstream precision. A factory method (from_linear) converts pre-trained FP16 linear layers, and a special from_qkv method fuses separate Q, K, V projections into a single quantized layer.
Usage
Apply this principle when replacing FP16 linear layers with INT8 versions to reduce memory footprint and leverage INT8 tensor cores.
Theoretical Basis
Given FP16 weight W and input x:
- Static weight quantization: W_q = round(W / s_w), s_w pre-computed
- Dynamic activation quantization: x_q = round(x / s_x), s_x = max(|x_row|) / 127
- INT8 GEMM: y_int = W_q @ x_q
- Dequantize: y = y_int * (s_w * s_x) + bias